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Abstract 
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Through  an  appeal  to  asymptotic  Gaussian  representations  of  certain  em¬ 
pirical  stochastic  processes,  we  are  able  to  apply  the  technique  of  continuous 
regression  to  derive  parametric  and  nonparametric  functional  estimates  for 
underlying  probability  laws. 

This  asymptotic  regression  approach  yields  estimates  for  a  wide  range 
of  statistical  problems,  including  estimation  based  on  the  empirical  qucintile 
function,  Poisson  process  intensity  estimation,  parametric  and  nonparametric 
density  estimation,  and  estimation  for  inverse  problems. 

Consistency  and  asymptotic  distribution  theory  are  established  for  the 
general  parametric  estimator.  In  the  case  of  nonparametric  estimation,  we 
obtain  rates  of  convergence  for  the  density  estimator  in  various  norms. 

We  demonstrate  the  application  of  this  methodology  to  inverse  problems 
and  compare  the  performance  of  the  asymptotic  regression  estimator  to  other 
estimation  schemes  in  a  simulation  study.  The  asymptotic  regression  esti- 
^  mates  2ire  ecisily  computable  and  are  seen  to  be  competitive  with  other  results 
in  these  areas. 
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1.  Asymptotic  Regression  Estimation 

1.1  Background 

Arguably,  estimation  is  the  essential  problem  of  statistics.  To  set  up  the 
conceptual  framework  for  estimation,  we  assume  that  a  random  quantity 
T  behaves  according  to  a  certain  but  unknown  probability  law  L  and  that 
quantitative  information  about  £  can  be  recovered  through  observation  of  T. 
In  the  English  language,  the  noun  estimate  connotes  a  subjective  judgment 
that  takes  the  place  of  definite  factual  knowledge.  This  is  akin  to  its  use  in 
statistics,  which  we  illustrate  with  the  typical  example  of  an  independent, 
identically  distributed  (i.i.d.)  random  sample. 

The  random  variable  T  has  a  probability  density  function 
given  by  /(•;  r)  where  the  parameter  r  is  unknown.  We  observe  a 
random  sample  (Ti, . . .  ,  Tn)  of  i.i.d.  values  of  T.  A  specific  func¬ 
tion  Tn{Ti,...  ,  r„)  of  the  observation  may  be  called  an  estimator 
of  T.  Given  an  observation,  the  value  of  that  function  is  then 
an  estimate  of  r.  The  estimate  is  considered  to  be  an  acceptable 
substitute  for  the  unknown  true  parameter  value. 

Efforts  to  quantify  estimator  accuracy  and  remove,  reduce,  or  otherwise  con¬ 
trol  the  subjective  element  in  this  process  can  be  traced  back  at  least  two 
hundred  years.  Since  that  time,  numerous  criteria  for  generating  and  select¬ 
ing  estimators  have  been  proposed,  including  the  methods  of  least  squares 
(due  to  K.  F.  Gauss  [20]),  moments  (due  to  K.  Pearson  [55]),  maximum  like¬ 
lihood  (due  to  R.  A.  Fisher  [19]),  minimum  chi-square,  minimum  distcince, 
maximum  product  of  spacings,  minimax,  Bayes,  and  Pitman,  to  name  a  few. 

In  spite  of  this  diversity,  however,  statisticians  seem  to  agree  that  the 
accuracy  of  a  useful  estimator  should  increase  as  the  sample  size  n  increases. 
This  requires  among  other  things  that  r  is  a  function,  which  is 

to  say  that  two  distinct  parameter  values  cannot  correspond  to  the  same 
distribution.  This  condition,  termed  identifiability,  is  a  feature  of  a  class  of 
probability  distributions  and  not  of  any  estimator. 
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Given  an  identifiable  class  of  distributions,  we  can  begin  to  talk  about 
desirable  properties  of  estimators.  First  of  all,  note  that  an  estimator  is 
calculated  from  a  finite  sample  that  is  representative  of  a  larger  finite  or 
possibly  infinite  population.  Fisher’s  [19]  original  characterization  of  the  no¬ 
tion  of  consistency  is  that  an  estimator  calculated  firom  the  entire  population 
should  achieve  the  true  parameter  value.  Of  course,  the  estimator  itself  is 
also  a  random  variable  (r.v.);  and,  formally,  ah  estimator  is  considered  to 
be  consistent  if  it  converges  in  probability  to  the  true  parameter  value  as  the 
sample  size  increases.  That  is, 

(Ve  >  0)  (V(J  >  0)  (3iV)  (Vn  >  N)  Pr(|T„  -  t|  xJ)  <  e, 
which  is  denoted  more  succinctly  by 

r„  r  as  n  — >  oo. 

A  concept  related  to  convergence  in  probability  is  that  an  estimator  may  be 
asymptotically  nnbiased.  That  is, 

Er„  T  as  n  -4  00, 

where  “E”  denotes  the  expectation.  For  an  asymptoticeilly  unbijised  estima¬ 
tor,  the  criterion  that  accuracy  increases  with  sample  size  can  be  formalized 
as 

VarT„-)-0  as  n-^oo, 

where  “Var”  denotes  the  variance.  This  states  that  the  dispersion  of  an 
estimator  should  decrease  as  n  increases.  The  connection  between  accuracy 
and  variance  of  an  estimator  was  recognized  in  the  time  of  Gauss  and  Laplace, 
as  noted  by  Stuart  and  Ord  [72]. 

The  existence  of  the  well-known  Cramer-Rao  lower  bound  for  the  variance 
of  an  estimator  provides  in  certain  cases  a  criterion  for  optimality  of  an 
estimator.  This  occurs,  for  example,  in  maximum-likelihood  estimation  of 
a  probability  density  function  parameter  based  on  a  random  sample.  See 
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C.  R.  Rao  [57]  for  an  exposition.  The  maximum-likelihood  estimator  (MLE) 
achieves  this  bound  asymptotically,  and  is  hence  asymptotically  optimal.  As 
a  special  case,  we  have  the  problem  of  regression  in  the  linear  model  with 
normal  error. 


Y  =  XI5  +  e,  (1.1) 

where  T  is  an  n-vector,  X  is  a  fixed  nxp  matrix,  /3  is  a  p-vector  of  coef¬ 
ficients  to  be  estimated,  and  the  error  vector  e  has  the  multivariate  normal 
distribution  with  e  ~  Ar„(0,  a^I).  In  this  case,  it  is  known  that  the  maximum- 
likelihood,  least-squares,  and  minimum- variance  unbiased  estimators  of  are 
one  and  the  same.  Seber  [60]  points  this  out. 

In  the  present  work,  we  propose  an  estimation  methodology  for  a  class  of 
problems  that  includes  estimation  of  r  based  on  i.i.d.  observations  Ti, . . . , T„ 
from  the  probability  density  /(•;  t).  The  scheme  is  based  on  a  generalization 
of  the  linear  regression  model  of  equation  (1.1). 

1.2  Introduction 

The  estimation  procedure  proposed  in  this  work  is  based  on  the  observation 
of  a  stochastic  process  with  certain  asymptotic  properties.  The  principle  of 
maximum  likelihood  and  the  technique  of  continuous-time  regression  are  ap¬ 
plied  to  an  asymptotic  version  of  the  observed  process  to  yield  estimates  of  an 
underlying  probability  law.  The  resulting  estimators  have  optimal  properties 
similar  to  those  of  maximum-likelihood  and  least-squares  estimators. 

We  can  now  describe  the  modeling  situation  and  estimation  procedure 
that  are  the  basis  of  this  work.  To  that  end,  let  {r„}neN  be  a  sequence  of 
random  variables  with  common  probability  law  £'(r),  where  the  unknown 
true  parameter  value  r  lies  in  some  suitable  parameter  space  ©.  It  is  r  that 
we  wish  to  estimate.  For  each  n,  let  X„(t)  be  a  stochastic  process  with 
sample  paths  in  a  space  8  of  functions  defined  on  a  domain  I.  Suppose  that 
Xn  is  determined  by  (Ti, . . .  ,  T^)  and  that  Xn  is  a  sufficient  statistic  for  £<(t). 
Furthermore,  let  the  sequence  of  stochastic  processes  converge  in 
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distribution  in  the  sense  that 

^/^{Xn-Mr)-^Ar  aS  71-^00,  (1.2) 

where  Mr  is  a  deterministic  function  and  i4r  is  a  zero-mean  Gaussian  stochas¬ 
tic  process  with  covariance  function  E  Ar{s)Ar{t)  =  Kr{s,t). 

For  a  finite-dimensional  (vector)  random  variable,  convergence  in  distri¬ 
bution  is  taken  to  mean  convergence  of  the  cumulative  distribution  at  each 
continuity  point.  The  infinite-dimensional  (function)  situation  is  more  com¬ 
plicated,  and  the  practical  technical  details  are  application-dependent.  Our 
discussion  of  convergence  in  distribution  for  random  functions  is  deferred  to 
section  2.6.1. 

The  following  model,  which  we  call  the  asymptotic  model  for  the  process 
Xn,  plays  a  central  role  in  the  development  of  the  proposed  estimator.  With 
Mr  and  Ar  as  in  (1.2),  consider  a  process  X*  defined  by 

(1.3) 

The  key  feature  of  this  model  is  that  the  mean  and  covariance  functions  of 
the  process  sequence  elements  share  a  common  parameter  r. 

In  the  remainder  of  this  chapter,  we  outline  a  general  estimation  tech¬ 
nique  for  the  unknown  parameter  of  the  asymptotic  model  (1.3)  based  on 
concepts  from  continuous-time  regression  and  maximum-likelihood  estima¬ 
tion  for  Gaussian  stochastic  processes.  The  technique  is  applicable  in  the 
finite-dimensional  case,  where  ©CM**  for  some  finite  positive  integer  d,  and 
also  in  the  nonparametric  setting,  where  ©  is  some  space  of  functions  on  I. 
The  proposed  estimator  for  r  based  on  the  X„  of  (1.2)  is  then  obtained  by 
using  Xn  in  place  of  X*  in  the  estimation  scheme.  (In  what  follows,  we  only 
occasionally  distinguish  from  X*.)  The  properties  of  these  estimators  are 
studied  in  the  remainder  of  this  work. 

In  chapter  2,  we  consider  the  parametric  estimation  problem,  in  which 
the  parameter  space  is  a  subset  of  for  a  finite  positive  integer  d.  We  also 
discuss  the  existence,  consistency,  and  optimality  of  our  estimators.  In  chap¬ 
ter  3,  we  consider  nonparametric  estimation  and  see  that  the  corresponding 
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penalized  estimation  procedure  yields  a  solution  with  desirable  properties 
and  provides  a  unified  approach  to  a  wide  class  of  statistical  problems.  In 
chapter  4,  we  discuss  the  practical  computational  aspects  of  the  proposed 
nonparametric  estimation  scheme. 

Fundamental  examples  of  stochastic  processes  that  satisfy  (1.2)  are  the 
empirical  cumulative  distribution  function  (c.d.f.),  empirical  quantile  func¬ 
tion  (q.f.),  and  Poisson  counting  process,  as  detailed  in  the  following  exam¬ 
ples. 

Example  (Random  Sample).  Let  71,...  ,r„  be  i.i.d.  random  variables 
defined  on  I  with  continuous  cumulative  distribution  function  F  =  and 
probability  density  function  /  =  F'.  The  empirical  c.d.f.  is  defined  as 

t=l 

It  is  well  known  that  F„  is  a  sufl5cient  statistic  for  r  and  that 
y/n  (Fn  —  F)  B  o  F  as  n  00. 

Here,  B  is  a  standard  Brownian  bridge,  which  is  a  zero-mean  Gaussian  pro¬ 
cess  with  covariance  function  s  At  —  st.  See,  for  example,  Billingsley  [5], 
Theorem  16.4.  So  we  identify  F„(t)  with  F{t)  -1-  B(F(t)),  and  then  we 

estimate  r  based  on  modeling  F„  by  equation  (1.3)  with  the  indicated  mean 
and  covariance  functions.  Specifically,  the  model  is 

Fn(f)  =  Fr(t)  -|-  ^Ar(t), 

where  A^.  is  a  zero-mean  Gaussian  process  with  covariance  function  Fr{s  A 
t)-Fr{s)Fr{t). 

Nussbaum  [48]  exhibits  the  following  result,  which  provides  theoretical 
justification  for  the  asymptotic  identification  of  an  empirical  process  with 
a  Gaussian  model  in  the  case  of  density  estimation:  The  two  sequences  of 
statistical  experiments  given  by  observations 

Ti,  i  €  {1, . . .  ,  n},  i.i.d.  with  p.d.f.  /,  and 
dX{t)  =  dt  +  dW{t), 
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where  W  is  a  standard  Wiener  process,  are  asymptotically  equivalent  in  the 
sense  of  Le  Cam’s  deficiency  distance.  Le  Cam  and  Yang  [43]  discuss  the 
asymptotic  characterization  of  statistical  experiments.  Our  aim,  however,  is 
to  exploit  the  general  heuristic  identification  of  (1.2)  with  (1.3)  and  thereby 
obtain  practical  and  useful  estimation  procedures  for  a  variety  of  statistical 
models.  And  so  we  continue. 

For  the  random  sample,  a  second  model  is  based  on  another  sufficient 
statistic,  the  empirical  quantile  function.  The  quantile  function  Q,  defined 
by  Q{u)  =  inf{t :  F{t)  ^  u},  is  the  unique  left-continuous  pseudo-inverse  of 
F.  The  empirical  quantile  function  is  given  by 

Qn{u)  =  inf{<  :  F„(i)  ^  u}, 

and  the  density  quantile  function  isg  =  foQ.  DiflFerentiation  of  F{Q{u))  =  u 
yields  Q'  —  1/g.  In  this  case,  it  is  known  that 

{Qn-Q)-^B  as  n  oo, 

where  B  is  a  standard  Brownian  bridge.  See,  for  example,  Shorack  and 
Wellner  [63],  section  18.1.  So  we  identify  Quit)  with  Qr{t)  + 
and  use  the  model 

Qnit)  =  QT{t)  + 

where  Ar  is  a  zero-mean  Gaussian  process  with  covariance  function 
Q'^{s)Q'^{t){s  At- St). 

Example  (Poisson  Process).  Let  Ti, . . .  ,r„  be  i.i.d.  Poisson  processes  on 
I  =  [0, 1].  Each  Ti  has  a  representation  as  a  sum  of  point  masses 

j=l 

and  with  each  Ti  we  associate  the  counting  process 

hi 

3=1 


6 


The  processes  Nt  have  a  common  compensator,  or  mean-value  function, 
G(t)  =  Grit)  =  EiVi(i).  Its  derivative  g  =  G'  is  called  the  intensity  function. 
A  sufficient  statistic  for  r  is 


”  <=1 


It  is  known  that 


^{Xr,-G)^y/^)‘Wo—  as  n-^oo, 


where  VF  is  a  standard  Wiener  process,  which  is  a  zero-mean  Gaussian  process 
with  covariance  function  EW(s)W(t)  =  s  A  t.  So,  we  identify  X„(t)  with 
G(t)  -I-  W  [G(t)/G(l)]  and  model  the  process  as 

Xr.{t)  =  G(()  +  ^A(t), 


where  A  is  a  zero-mean  Gaussian  process  with  covariance  function  G{s  A  t). 


Other  examples  of  applications  that  may  fit  into  this  scheme  include 
estimation  based  on  the  hazard  function,  random  censoring  models,  marked 
Poisson  processes,  and  deconvolution  and  other  ill-posed  inverse  problems. 

A  special  case  of  this  procedure  is  noted  by  Emanuel  Parzen  [54].  He  con¬ 
siders  problems  of  location  and  scale  estimation  based  on  continuous-time  re¬ 
gression  of  the  empirical  quantile  function.  His  methodology  reproduces  well- 
known  results  about  the  use  of  linear  combinations  of  order  statistics  to  solve 
such  problems.  For  distributions  with  location  and  scale  dependence,  such 
as  the  three-parameter  lognormal  and  Weibull  distributions,  Kindermann 
and  LaRiccia  [32]  propose  an  e£isily  computable  generalization  of  Paxzen’s 
procedure. 

As  stated,  the  estimation  technique  we  propose  is  based  on  ideas  from 
continuous-time  regression.  So,  before  the  estimators  are  defined,  we  give 
a  brief  overview  of  that  subject.  We  also  develop  the  generalizations  that 
enable  us  to  apply  continuous-time  regression  to  the  modeling  of  this  section’s 
examples. 
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1.3  Regression  for  Gaussian  Processes  with  Known 
Covariance 

This  section  contains  the  basic  facts  we  need  about  continuous-time  regres¬ 
sion  for  Gaussian  stochastic  processes  with  known  covariance  functions.  More 
details  are  presented  in  the  Appendix.  The  development  is  adapted  from  the 
work  of  Emanuel  Parzen  [51],  [52],  and  [53]. 

Consider  a  Gaussian  stochastic  process 

X{t)  =  M{t)  ^  A{t) 

defined  for  i  G  7  with  unknown  mean- value  function  EA’(t)  =  M{t)  and 
known  covariance  function  E  A.(s)A(t)  =  K{s,t).  The  purpose  of  continuous¬ 
time  regression  is  estimation  of  the  mean- value  function.  This  is  accom¬ 
plished  by  identifying  an  appropriate  likelihood  ratio  and  then  conducting 
maximum-likelihood  estimation. 

The  reference  measure  for  the  likelihood  ratio  is  derived  from  another 
process  Y{t)^  which  is  a  zero-mean  Gaussian  stochastic  process  with  covari¬ 
ance  function  Ey’(s)y(t)  =  K{s,t),  so  X  and  Y  have  the  same  (known) 
covariance  function.  Denote  by  T(7f,  M)  and  7{K)  the  probability  measures 
induced  by  X{t)  and  Y{t)^  respectively,  on  the  space  of  sample  paths.  The 
likelihood  ratio  itself  is  the  Radon-Nikodym  derivative 

dfP{K,M) 
d?{K)  ’ 

which  is  defined  using  a  space  of  random  variables  7/2  (-X"),  a  function  space 
TTr-,  and  a  map  <f>  between  the  two. 

To  define  L2(X),  first  consider  the  linear  span  of  X(t),  denoted  L{X)  and 
defined  by 

L{X)  =  1 diXiU) :  n  G  N,  tj  G  7,  Oi  G  R 
l»=i 

This  is  the  set  of  all  finite  linear  combinations  of  values  of  X  taken  at  ar- 
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bitrary  points.  An  inner  product  on  L{X)  is  given  by  (u,  v)  =  Euv.  The 
Hilbert  space  L2{X)  is  the  completion  of  L{X)  in  the  corresponding  norm 
||u||  =  V^. 

Denote  by  Hk  the  reproducing  kernel  Hilbert  space  (RKHS)  of  functions 
on  I  with  reproducing  kernel  K,  inner  product  (•,  and  norm  ||  •  Hk,  where, 
of  course,  ||x||k  =  See  Aronszajn  [2]  for  a  discussion  of  reproducing 

kernel  Hilbert  spaces.  The  point  evaluation  functional  representers  in  Hk 
are  denoted  by  Kt,  t€  I,  where  Kt{s)  =  K{s,t).  These  have  the  “reproduc¬ 
ing”  property — namely,  {Kt,  f)K  =  fit)  for  all  t  G  7  and  /  G  Hk,  and  in 
particular  {Kt,  Kt)  =  K{s,  t)  for  all  5  and  t  in  7. 

The  function  <f)K  '  Hk  — ^  L^iX)  is  defined  on  the  generators  of  Hk  by 
<l>KiKt)  —  X{t)  and  by  linear  extension  on  the  whole  space.  The  map  (f>  is  in 
fact  a  congruence,  or  inner-product-preserving  vector  space  isomorphism. 

We  are  now  able  to  define  the  likelihood  ratio. 

Theorem  1.1.  In  the  event  that  M  G  Hk,  the  measures  "y{K,  M)  and  ‘P{K) 
are  equivalent.  Then  their  likelihood  ratio  is  given  by 

m)  =  =  «P  [MM)  -  IIIM&] .  (1.4) 


Proof.  See  Parzen  [51]. 


□ 


This  functional  is  the  basis  for  determining  maximum-likelihood  estimates  of 
the  parameter  M.  Specifically,  one  takes  as  the  estimator  any  value  M  that 
is  a  solution  of  the  optimization  problem 


maximize  L{M)  subject  to  M  e  Hk. 

M 

Thus,  the  estimator  M  satisfies 


dy{K,M) 

dT{K) 


(X)  =  sup 


{dT{K,M) 
\  d?{K) 


{X):Me  Hk 


] 
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which  is  typically  an  ill-posed  problem.  To  avoid  this  difficulty,  one  usually 
specifies  a  set  M  of  candidate  functions  for  M  and  then  attempts  to  solve  a 
problem  equivalent  to 

maximize  L{M)  subject  to  M  e  M.  (1.5) 

M 

In  this  case,  the  estimator  M  satisfies 

It  is  tempting  to  observe  that  in  a  formal  sense  <t>K{Kt)  =  {Kt,  X)j^  = 
X{t),  and  so  <f>K{f)  =  {ftX)j^  for  all  /  €  Hk  as  another  expression  of  the 
reproducing  property  of  the  Kf.  Then  one  would  have 

-2  log  i(Af)  = -2fliK(^)  + IIJ'^IIk 

=  -2(jr,M>*.  +  ||M|£. 

=  (1.6) 

and  the  optimization  problem  (1.5)  would  be  a  true  least-squares  problem. 
However,  the  sample  paths  of  X  do  not  necessarily  lie  in  Hk,  the  construction 
<pK{f)  cannot  be  an  inner  product  of  elements  in  Hk,  and  the  characteriza¬ 
tion  of  (1.6)  is  only  formal.  Some  authors  use  the  inner  product  notation 
for  with  the  caveat  that  it  is  not  really  an  inner  product.  We  reserve 
inner  product  notation  for  inner  products  and  denote  the  congruence  by  (f>. 
Nonetheless,  in  all  of  our  applications,  the  congruence  does  indeed  have  the 
same  form  as  the  inner  product. 

We  now  consider  several  standard  parameter  set  configurations,  or  possi¬ 
bilities  for  the  set  M.  The  first  two,  parametric  and  nonparametric  estima¬ 
tion,  are  the  ones  used  in  the  current  work.  The  other  three  are  included  for 
their  intrinsic  interest  and  to  illustrate  the  connection  between  continuous 
regression  for  Gaussian  processes  and  discrete  finite  least-squares  regression. 
The  standard  parameter  set  configurations  follow. 
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(i)  Nonparametric  Estimation.  M  is  a  subset  of  some  function  space  such 
as  Li.  This  is  the  most  general  case,  in  terms  of  the  restrictions  placed  on 
candidate  functions. 

(a)  Parametric  Estimation.  In  this  case  M  =  {Mr  :  t  €  0}  for  some 
(finite-dimensional)  set  ©  C  R**  and  family  of  parametric  functions  Mr{t). 
The  estimator  usually  exists,  as  long  as  3Vt  is  a  reasonable  parametric 
family.  But  calculation  can  be  troublesome,  and  determining  the  proba¬ 
bilistic  properties  of  the  estimator  in  complete  generality  can  be  practically 
impossible.  This  is  not  so  in  the  following  situation,  which  is  a  special  case 
of  (a). 

(Hi)  Finite-Dimensional  Subspace.  For  a  fixed  positive  finite  integer  k, 
choose  the  functions  ,/*  in  Hk  and  let  M  =  €  R}. 

Then  one  can  show  that  the  estimator  is  given  by 

k 

t=l 

in  which  the  vector  A  =  {Ai, ,  Ak)^  is  a  solution  of  the  normal  equations 
CA  =  B,  where  the  matrix  C  and  vector  B  have  components  Cij  =  {fufj)^ 
and  Bi  =  <f>K{fi),  respectively.  Note  the  similarity  to  linear  regression.  In  this 
case,  Af„  is  a  uniformly  minimum-variance  unbiased  estimator.  See  Parzen’s 
papers  for  the  details. 

Comments  on  the  utility  of  the  following  two  examples  range  from  “illumi¬ 
nating”  in  Parzen  [51],  section  8.36,  to  “of  little  interest”  in  Grenander  [22], 
p.  98. 

(iv)  Finite  Domain.  Consider  X{t)  =  M(t)  -I-  A{t)  with  i  G  {ti, . . . 

Then  X,  M,  and  A  are  finite-dimensional  (column)  vectors,  so  we  can  write 
them  in  terms  of  their  components;  i.e.,  X  =  (Xi, . . .  ,X„)^,  where  Xi  = 
X{ti),  and  likewise  for  M  and  A.  Thus,  the  model  becomes 

X  =  M  + A, 

with  EX  —  M  and  A  ~  iV„(0,  K).  The  vector  A  has  the  multivariate  normal 
distribution  with  variance-covariance  matrix  K  =  E  AA'^,  where  the  compo- 
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nents  of  K  are  Kij  =  E  AiAj.  In  this  case,  Hk  =  K”  with  the  interpretation 
that  fi  =  f{ti)  for  /  =  (/i, . . .  ,  e  Hk.  The  inner  product  in  Hk  is 
given  by  (/,  g)j^  =  The  point  evaluation  functional  representers  in 

Hk  are  the  columns  of  K.  Specifically,  let  Ki  be  the  i***  column  of  K.  Then 
the  requisite  properties  {Ki,  Kj)j^  =  Kij  and  (/,  Ki)j^  =  fi  are  satisfied.  The 
congruence  <f>K  is  given  by  <f>K{f)  =  so  that  (f>K{Hi)  =  Xi,  as  re¬ 

quired.  The  likelihood  ratio  for  X  with  respect  to  a  zero-mean  multivariate 
normal  random  variable  having  the  same  variance-covariance  matrix  is  sim¬ 
ply  the  quotient  of  the  appropriate  multivariate  normal  probability  densities, 

(27r|iir|)-V2exp[-i(X  -  M)‘^K-\X  -  M)] 
(27r|ii£:|)-i/2exp[-iA'2’iir-iX] 

It  is  easily  verified  that  L  =  exp  [<f>K(M)  —  |||M||^].  Thus,  the  continuous- 

regression  setup  reduces  to  the  familiar  maximum-likelihood  formulation  for 
finite  regression,  and  the  estimator  M  is  given  by 

■^  =  argmjn  |1X-M|||.. 

Note  that  this  is  a  true  least-squares  problem.  A  specific  form  for  M  is 
considered  in  the  final  example. 

(v)  Linear  Model.  In  case  (iv),  fix  an  n  x  A:  matrix  Z  and  let 

M  =  : /3  G  1*=}. 

Then  it  is  very  well  known  that  the  estimator 

^  =  arg  min  IIX  -  Z/3III- 
is  any  solution  P  of  the  normal  equations 

Z'^K-^Zp  =  ZF'K-^X. 

In  the  nonsingular  case,  the  unique  solution  is 

P  =  {Z'^K~^Z)-^Zf^K-^X. 
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Of  course,  this  is  the  classical  linear  regression  model. 

In  our  applications,  we  only  consider  the  parameter  set  configurations 
of  cases  (i)  and  (ii).  The  remaining  cases  simply  illustrate  the  connection 
between  continuous  and  discrete  regression  for  Gaussian  processes  and  are 
included  to  illustrate  the  fact  that  continuous  regression  is  a  generalization 
of  the  familiar  discrete  (finite)  situation. 

1.4  Sequences  of  Processes  with  Known  Covariance 

The  basic  model  for  continuous  regression,  described  in  section  1.3,  is 

X{t)  =  M{t)+A{t) 

with  EX(t)  =  M{t)  unknown  and  Ej4(s)j4(t)  =  K(s,t)  known.  Two  gen¬ 
eralizations  of  this  model  are  required  in  order  to  apply  the  methodology  of 
continuous  regression  to  the  models  described  in  section  1.2.  In  this  section, 
we  discuss  the  first  required  generalization;  that  is,  we  identify  the  likeli¬ 
hood  ratio  sequence  for  the  known-covariance  version  of  the  model  sequence 
(1.3).  The  second  generalization,  adaptation  of  the  continuous-regression 
methodology  to  the  case  of  unknown  covariance,  is  discussed  in  section  1.5. 
Finally,  in  section  1.6,  we  combine  the  two  generalizations  and  formulate  the 
estimation  principle  that  is  the  subject  of  this  work. 

Now  we  compute  the  likelihood  ratio  sequence 

d7{K)  ^ 

for  the  sequence  of  models 

Xn{t)  =  M{t)  +  (1.7) 

The  mean- value  function  here  is  EX„(t)  =  M(t),  and  the  (known)  covariance 
functions  Kn  satisfy 

Kn{8,t)  =  Coy[Xnis),X„{t)]  =  iK{s,t). 
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As  in  section  1.3,  ‘^{Kn,  M)  and  y(K)  are  the  probability  measures  induced 
by  Xn  and  Y,  respectively,  on  the  space  of  sample  paths.  Here,  Y{t)  is  a 
zero-mean  Gaussian  process  with  covariance  function  Ey(s)y(<)  =  K{s,t). 

First,  observe  that  nKn  =  K.  It  is  clear  that  Hk  =  The  inner  prod¬ 
uct  of  Hk  is  characterized  by  whereas  the  inner  product 

on  Hk„  is  characterized  by  =  Kn{t,t).  Then  we  have 

\\K,fK  = 

and  so  for  all  G  €  Hk 

Next,  we  consider  the  sequence  of  maps  <f>n  :  Hk„  L2{X„),  which  satisfy 


MKnt)  =  Xr^it) 

for  all  n  G  N,  where  Knt  =  Kn{‘,t).  We  make  the  dependence  of  <t>K  upon 
X  explicit  and  recall  that  <j>K  •  Hk  L2{X)  is  characterized  by  <f>K{Kt)  — 
<f>K{X,Kt)  =  X{t).  The  map  <f>n  :  Hk  ->  L2(Xn)  likewise  satisfies  <f>n{Knt)  = 
<t>n{Xn,Knt)  =  Xn{t),  SO  by  linearity  we  get  <f>„{Xn,Kt)  =  <f)n{Xn,nKnt)  = 
nXnit).  Assuming  the  formal  dependence  of  the  maps  on  the  processes  is 
independent  of  n,  which  is  true  in  all  practical  situations,  we  have 

<l>n{Xn,  G)  =  n^K^Xfi,  G). 


Then  the  likelihood  ratio  sequence  for  (1.7)  is  given  by 

=  exp  [n(f>KiXn,M)  -  f  l|M||y 


d‘P{K,M) 


(Xn) 


[  d?(K) 

Of  course,  a  maximum-likelihood  estimator  M  of  M  satisfies 


(1.8) 
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Therefore,  in  light  of  the  scaling  property  of  equation  (1.8),  we  finally  observe 

A. 

that  a  maximum-likelihood  estimator  M  of  M  also  satisfies 


d7{K,M) 

d?{K) 


{Xn)  -  sup 


(d?{K,M) 
\  d?{K) 


{X„) :  M  €  M 


for  suitable  M.  Thus,  for  the  purposes  of  optimization,  the  dependence  of 
the  likelihood  ratio  on  n  may  be  ignored.  We  simply  use  the  likelihood  ratio 
(1.4). 


1.5  Regression  for  Gaussian  Processes  with  Unknown 
Covariance 


The  results  of  sections  1.3  emd  1.4  pertain  to  Gaussian  processes  with  known 
covariance.  They  are  not  directly  applicable  to  the  modeling  situations  of 
section  1.2.  In  this  section,  we  propose  an  iterative  estimation  scheme  for 
a  certain  class  of  Gaussian  processes  (with  unknown  covariance)  that  does 
include  the  models  of  section  1.2. 

Consider  a  Gaussian  stochastic  process 


X{t)  =  M{t)+A{t) 

with  unknown  mean  EX(t)  =  M(t)  and  covariance  Ei4(s)j4(t)  =  KM{s,t). 
Note  that  the  covariance  depends  on  the  unknown  mean.  Equivalently,  the 
mean  and  covariance  functions  share  a  common  unknown  parameter  that  we 
wish  to  estimate,  as  in  the  models  of  section  1.2.  We  assume  that  the  true 
mean-value  function  M  lies  in  some  fixed  set  M  of  candidate  mean-value 
functions. 

We  construct  a  recursive  sequence  (Mq,  Mi,  M2, . . . )  of  estimators  for  M 
as  follows.  Select  an  arbitrary  Mq  €  M,  and  for  i  ^  1,  let  Mj  e  JVC  be  such 
that 

d?{KM,. 

In  words,  we  first  assume  an  “initial  guess”  parameter  value  Mq.  Then  we  re¬ 
peatedly  calculate  the  covariance  function  using  the  current  parameter  value 


Mi) 


'}■ 
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and  estimate  a  new  parameter  value  by  means  of  the  principle  of  maximum 
likelihood  for  Gaussian  processes  with  known  covariance  described  in  sec¬ 
tion  1.3. 


1.6  The  Asymptotic  Regression  Estimation  Principle 

We  can  now  formulate  the  general  principle  for  estimation  of  the  unknown 
parameter  of  section  1.2. 

Apply  the  recursive  estimation  scheme  of  section  1.5  to  the 
probability  model  sequence  given  by  equation  (1.3),  taking  into 
account  the  scaling  property  of  section  1.4. 


Explicitly,  the  asymptotic  model  sequence  is 


=  M(t)  + 

where  A{t)  is  a  zero-mean  Gaussian  process  with  covariance  function 
EA{s)A(t)  =  KM{s,t).  The  covariance  function  depends  on  the  unknown 
mean-value  function.  Equivalently,  the  mean  and  covariemce  functions  share 
a  common  parameter.  Our  aim  is  to  estimate  this  unknown  parameter. 

We  assume  that  3Vt  is  a  fixed  set  of  candidate  mean-value  functions  and 
that  the  process  sequence  {Xn}neN  is  given.  The  definition  follows. 


Definition  (Asymptotic  Regression  Estimator).  For  each  n,  construct 
the  recursive  sequence  Afn.ij  Af„,2, . . . )  of  estimators  for  M  in  this  man¬ 

ner:  Select  an  arbitrary  M„,o  G  3Vt,  and  for  i  ^  1,  let  G  M  be  such  that 


,  Mn,t) 


(X„)  =  sup 


(Xn)  :MeM 


We  use  the  terms  asymptotic  regression  estimator,  AR  estimator,  and  ARE 
to  refer  to  any  element  of  a  sequence  so  obtained. 


We  intend  to  show  that  for  fixed  arbitrary  i  ^  1,  the  AR  estimator 
has  good  properties  as  n  oo.  Carroll  and  Ruppert  [7]  have  proposed  a 
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similar  iterative  estimation  scheme  for  the  problem  of  nonlinear  regression 
with  heteroscedastic  error.  We  hope  to  show  for  our  problem,  as  they  have 
done  for  theirs,  that  stopping  the  procedure  after  a  small  number  of  steps 
results  in  an  estimator  with  reasonable  small-sample  properties. 

In  chapter  2,  we  consider  the  parametric  estimation  problem,  in  which 
the  parameter  space  0  is  a  subset  of  R'*  for  some  finite  positive  integer  d. 
We  discuss  the  existence,  consistency,  and  large-sample  distributions  of  AR 
estimators.  Results  in  this  chapter  establish  the  asymptotic  optimality  of 
AR  estimators. 

In  chapter  3,  we  consider  nonparametric  AR  estimation.  We  see  that 
the  solution  of  the  corresponding  penalized  problem  is  an  estimator  with 
desirable  properties.  Also,  we  see  that  AR  estimation  provides  a  unified 
approach  to  a  wide  class  of  statistical  problems. 

In  chapter  4,  we  discuss  the  practical  computational  aspects  of  nonpara¬ 
metric  AR  estimation.  Topics  here  include  discretization,  software  implemen¬ 
tation,  and  a  reliable  dita-driven  method  for  smoothing  parameter  selection. 
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2 


.  Parametric  AR  Estimation 

2.1  General  Parametric  Estimation 


In  this  chapter,  we  consider  asymptotic  regression  estimation  in  the  case 
of  a  finite-dimensional  real  parameter  space.  In  this  section,  we  define  the 
parametric  AR  estimator  and  its  associated  optimization  problem.  In  sec¬ 
tion  2.2,  we  present  and  discuss  the  main  results  concerning  the  consistency 
and  asymptotic  properties  of  the  AR  estimator.  Sections  2.3,  2.4,  and  2.5 
describe  the  application  of  AR  estimation  to  some  standard  statistical  mod¬ 
eling  situations.  Section  2.6  contains  technical  material  including  the  proofs 
of  the  theorems  in  section  2.2. 

To  cast  the  AR  model  and  estimation  procedure  of  section  1.2  into  the 
parametric  setting,  we  suppose,  as  always,  that  is  a  sequence  of  stochastic 
processes  with  sample  paths  in  a  space  S  of  functions  defined  on  a  domain 
I.  In  this  chapter,  we  denote  the  true  parameter  value  by  r  and  assume  that 
r  €  0  C  M**  where  d  is  a  positive  integer. 

Now  let  i4  =  A,,  be  a  Gaussian  process  with  mean  value  EA  =  0  and 
continuous  positive-definite  covariance  function  EA(s)A(t)  =  ii£r^(s,t),  and 
let  Mr  be  a  deterministic  function  on  J.  Next,  define  An,r  =  y/n{Xn  -  Mr) 
and  suppose  that 


A 


n,T 


as  n  — >  oo. 


(2.1) 


In  order  to  define  the  AR  estimator,  we  need  to  identify  certain  spaces,  op¬ 
erators,  and  norms.  So  for  any  9  eQ,  let  Hg  be  the  RKHS  with  reproducing 
kernel  K0,  inner  product  (•,  •)g,  norm  ||  •  jj^,  and  point  evaluation  functional 
representers  Kgt-  We  require  that,  while  the  norms  may  depend  on  the 
underlying  spaces  remain  constant.  That  is  to  say.  He  =  Hr  =  H  for  all 
0  6  0.  Furthermore,  suppose  that  Mb  e  H  for  each  0  6  0.  Let  the  bilinear 
functional  <I)b  :Bx  H  -^R  satisfy 

<i>e{Z,KBt)=^Z{t) 


19 


for  each  tel.  Since  g)  =  (/,  g)^  if  /  €  He,  this  map  is  given  by  the 
inner  product  when  both  of  its  arguments  lie  in  Hg. 

Let  7  be  a  fixed  element  of  ©.  The  probability  density  functional  for  a 
Gaussian  stochastic  process  Z  with  mean  Mg  and  covariance  with  respect 
to  a  mean-zero  Gaussian  process  having  the  same  covariance  is 

=  exp  y>,(Z,  M,)  -  l||M,||g  .  (2.2) 

With  n  €  N  and  Xn  both  fixed,  we  tahe  as  an  estimator  of  r  any  r„  €  0 
satisfying 


<ia>(7) 


(Jf„)  =  sup  I 


cO’iy) 


{X„) :  9  6  e 


We  now  define  the  AR  estimator  sequence. 


Definition  (Parametric  AR  Estimator).  For  fixed  n,  Xn,  and  r„,o  €  ©, 
the  recursive  sequence  of  AR  estimators  (rn,0)  Tn,i,  Tn,2, . . . )  is  defined  for  all 

We  occasionally  write  this  as  =  iS'n(Tn,j-i)  for  notational  convenience. 

Maximization  of  the  likelihood  ratio  (2.2)  is  equivalent  to  minimization 
of 

=  -logS^(X„)  =  +  l||Af,||“, 

which  can  be  rewritten  as 

Jn,y{^)  —  ~<f>y  -t- 

=  i||M,||2  -  {Mr,Mg)^  -  ■^UAn,r,Mg) 

=  HIM,  -  MHI?  -  i||M,||?  -  ^<f>,{An,r,Mg). 

It  is  convenient  to  define 

=  2  11-^®  “  ~ 
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Of  course,  the  ARE  can  also  be  characterized  as  the  minimizer  9  of  the 
functional  We  refer  to  Ln,^{9)  as  the  AR  objective  functional 

2.2  Properties  of  Parametric  Estimators 

In  this  section,  we  present  and  discuss  the  main  results  on  the  properties 
of  parametric  AR  estimators.  Technical  details  and  proofs  are  deferred  to 
section  2,6.  Under  moderate  assumptions,  Theorem  2.1  establishes  the  con¬ 
sistency  of  the  first-stage  {i  =  1)  estimator  in  the  ARE  sequence.  This 
estimator  is  computed  using  an  arbitrary  guess  for  the  covariance  parame¬ 
ter.  With  stronger  assumptions.  Theorem  2.2  establishes  consistency  in  the 
regular  case.  In  this  context,  regular  means  that  an  estimator  is  obtained  as 
a  zero  of  the  derivative  of  an  objective  functional. 

Theorem  2.1.  Consider  the  model  of  section  2.1.  Suppose  r  is  the  true  pa¬ 
rameter  value.  Fix  j  €  0.  Assume  that  the  following  conditions  are  satisfied. 

(1)  0  is  compact. 

(2)  The  map  6  Mq  is  continuous,  so  that  Ln,-f{9)  is  continuous  on  9  €  Q 
for  all  n. 

(3)  For  any  S  >  0,  inf{||A!ffl  -  Mr||^  :  \9  -  t\'^8}>  0. 

(4)  sup{|(^^(A„,r>  Afe)|  :  ^  €  ©}  =  Op(n^/^)  as  n-^  oo,  so  that 

sup{|Ln,7(0)  -  EX„,.y(0)| :  0  €  0} 0  as  n-^oo. 

Then  any  sequence  of  AR  estimators  {r„}„gN  given  by 

Ln,'rM  =  inf{L„,.^(0)  :  0  e  0} 


has  the  property 


as  n  — )•  oo. 


21 


In  what  follows,  the  dot  denotes  differentiation  with  respect  to  the  pa¬ 
rameter.  With  stronger  smoothness  conditions,  we  get: 

Theorem  2.2.  Consider  the  model  of  section  2.1.  Suppose  r  is  the  true  pa¬ 
rameter  value.  Fix'y  €  0.  Assume  that  the  follotving  conditions  are  satisfied. 

(1)  0  is  compact,  and  t  is  in  the  interior  of  0. 

(2)  Me  is  6 -differentiable  and  Me  €  H,  so  that  Ln-f{B)  is  6 -differentiable 
on  0  for  all  n.  \ 

(3)  For  any  5  >  0,  inf{||Mtf  -  Mr||^  :  —  t|  ^  5}  >  0. 

(4)  sup{|(^.y(i4„,r»  ^«)|  :  ^  e  0}  =  asn-^  oo,  so  that 

sup{|L„,.y(0)  —  ^  G  0}  0  as  n  — ^  oo. 

Then,  there  is  a  sequence  of  AR  estimators  {r„}neN  satisfying  both  Ln-^{Trf)  = 
0  and 

T„  r  as  n  oo. 

We  now  state  two  theorems  on  the  as3rmptotic  distributions  of  AR  esti¬ 
mators.  We  consider  the  case  of  a  sufficiently  differentiable  objective  with 
a  unique  minimum.  Of  course,  these  conditions  can  be  weakened  in  many 
ways.  We  restrict  our  attention  to  the  univariate  parameter  case,  with  d  =  1. 
The  arguments  can  easily  be  generalized  to  the  vector  case.  Theorem  2.3  es¬ 
tablishes  the  asymptotic  normality  of  the  first-stage  estimator.  Theorem  2.4 
establishes  the  asymptotic  optimality  of  AR  estimators  for  i  >  1.  The  oper¬ 
ator  Sabi  which  appears  in  the  statements  of  these  two  theorems,  is  defined 
in  Lemma  2.8  of  section  2.6.1. 

Theorem  2.3.  Consider  the  model  of  section  2.1.  Suppose  r  is  the  true  pa¬ 
rameter  value.  Fixy  €  0.  Assume  that  the  following  conditions  are  satisfied. 

(1)  Q  is  compact,  and  t  is  in  the  interior  of  Q. 
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(2)  Me  and  L„,y{6)  are  twice  9 -differentiable  on  0;  and  Me,  Me,  and  Me 
are  in  for  all  9  £ 

(3)  The  maps  9  Me,  9  i->  Me,  o,nd  9  Me,  are  continuous;  i.e., 
\\Ma  -  MpW^  +  \\Ma  -  MplU  +  ||M«  -  M0\U  -^0if\a-l3\^0. 

(4)  Lz^iO)  =  K}. 

(5)  sup{|(^-y(A^,  Me  -  Mr)\  :  -  r|  ^  e}  =  Op(l)  as  £  ->•  0. 

(6)  For  sufficiently  small  e  >  0,  as  n oo, 

9e{An,r)  =  SUp  ^\<j)^{An,r,  Me  -  Mr)| :  -  r|  ^  ej  =  Op(n^l^). 


Then  the  AR  estimator  sequence  {Tn}n€N  has  the  property 
Vn-{Tn-T)-^Yr^N 


if, 

1 0>  ..  .  I  as  n 

V  } 


00. 


One  consequence  of  the  preceding  theorem  is  that 

r„  r  as  n  — >  oo, 


so  that  the  first-stage  AR  estimator  is  weakly  consistent.  Also,  in  the  event 
that  7  =  T,  it  is  easily  shown  that 

,  -  Vax  1  1 

"  ||JW,||J  ||M,||?  Var^.(JW,)' 

This  is  the  Cramer-Rao  lower  bound,  assuming  its  existence.  In  this  Ccise,  Tn 
is  a  best  asymptotically  normal  estimator  of  r. 

With  some  additional  assumptions  on  the  uniform  behavior  of  the  para¬ 
metric  function  family  and  the  stochastic  process,  trivial  modification  of  The¬ 
orem  2.3  yields  the  following  result  on  the  asymptotic  consistency,  normality, 
and  optimality  of  the  AR  estimators  for  i  >  1. 


Theorem  2.4.  Consider  the  model  of  section  2.1.  Let  N{t)  C©  be  a  neigh¬ 
borhood  of  T.  Suppose  that  the  following  conditions  are  satisfied. 
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(1)  For  fixed  X  £$  and  F  e  H,  the  map  7  i-4  <f>^{X,F)  is  continuous  on 
N{r). 

(2)  For  fixed  F  and  G  E  H,  the  map  7  {F,G)^  is  continuous  on  N(t). 

(3)  The  maps  6  Me,  6  i->  Me,  and  6  Me  are  continuous  for  9  €  Q 
uniformly  for  7  €  N{t);  i.e.,  if  \a  —  0\—^O  then 

sup{\\Za-Zp\\,:'reN{r)}^0 

for  each  Z  €  {M,  M,  M}. 

(4)  For  sufficiently  small  e  >  0,  as  n-^  00, 

ge{An,r)=  SUp  SUp  -  Mr)| :  “  r|  ^  =  Op(n^/^) . 

7€N(r)  ^  ^ 

Then,  for  any  fixed  i  >  1,  the  AR  estimator  sequence  {Tn,»}n6N  has  the 
property 

y/n  (Tn,i  —  t)  N  (0,  /(r)-^)  as  n-^  00, 

where  I{t)  is  Fisher’s  Information  Measure. 

This  implies  that  stopping  the  procedure  at  any  i  ^  2  gives  a  best  as3rmp- 
totically  normal  estimator  based  on  X„.  Recall  that  the  likelihood  ratio  is 

A(Z)  =  =  “P  -  HIM, II?] , 

and  that  information  is  defined  using  the  quantity 

^logA(Z)  =  n  (f>.y{Z,Me)-{Me,Me)^ 
as  follows.  Let  r  be  the  true  parameter  value.  Then 

E  ^  log  A{Xn)  =  n  Mr)  -  n  (Mr,  Mr^^  =  0 
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since  <l>'y{Z,  F)  =  (E^  Z,  F)^.  The  information  in  about  r  is  defined  as 


7(r)  =  E 


=  Var 


^logA(X„) 


See,  for  example,  Kutoyants  [36],  p.  10.  So  here  we  have 

I{r)  =  Var  <f>^{Xn,Mr)  =  n  Var  Mr) 

~  nVar  <l>^{Ar,Mr)  =  71115.^^^x11?, 


in  the  sense  that  /  ~  p  if  and  only  if//^— >lasn— >  oo.  Thus,  the  lower 
bound  for  the  asymptotic  variance  of  an  unbiased  estimator  of  r  based  on 

Xfi  is  ^ 

which  we  may  compare  to  the  asymptotic  variance 


n\\Mr\\i, 


of  a  first-iteration  AR  estimator.  In  the  case  that  7  =  r,  we  simply  have 


which  is  indeed  the  asymptotic  variance  of  the  AR  estimators  for  all 
i  >  1,  and  of  Tn,i  when  Tn,o(=  7)  =  t. 

In  the  remainder  of  this  chapter,  we  consider  some  specific  applications 
and  develop  results  concerning  the  limiting  {i  00)  behavior  of  the  AR 
estimator  sequence  for  finite  samples. 


2.3  Application  to  Density  Estimation 

Here  we  develop  the  form  of  the  AR  estimator  specific  to  probability  density 
function  (p.d.f.)  estimation.  The  prime  denotes  differentiation  with  respect 
to  t  6l. 
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Let  ^1, . . .  ,  be  i.i.d.  real  r.v.’s  on  /  C  R  with  continuous  c.d.f.  F  = 
Fr  where  r  €  ©  C  for  some  feasible  parameter  set  0.  The  empirical 
cumulative  distribution  function 

has  mean- value  function 

Ef„(t)  =  F(t) 

and  covariance  function 


CovlF„{s),  E.W]  =  i  [f  («  At)  -  i?(s)E(t)) . 


Suppose  7  G  0.  Let  X{t)  =  Xi{t)  =  F„{t)  —  F^{t),  and  let  Xo(t)  be  a  zero- 
mean  Gaussian  process  with  covariance  function  Ky{s,t)  =  At)  - 

F.y(s)F^(t)].  We  model  X{t)  as  a  Gaussian  process  with  mean  F®(t)  —  F^{t) 
and  covariance  K^. 

Let  denote  the  reproducing  kernel  Hilbert  space  with  reproducing 
kernel  K^.  With  the  map  <f>^  :  L2{X)  given  by  <l>y{K^t)  =  X{t),  the 

required  derivative  is 


m  =  =  exp  [UFt  -  F-,)  -  -  F,ll?] , 

as  long  zs  Fe  —  Fy  £  Hy.  We  use  Fo  —  Fy  in  the  likelihood  ratio  because  Hy 
consists  of  functions  A  on  [0, 1]  with  ^4(0)  =  A(l)  =  0. 

Because  Ky{s,t)  =  u{s  A  t)  v{s  V  t),  where  u  =  Fy  and  u  =  1  —  Fy,  the 
RKHS  inner  product  for  an  empirical  c.d.f.  (Brownian  bridge)  covariance 
structure  is  given  by  equation  (A.2)  as 


as  long  as  \imZ{xY  Fy{x)~^  =  0  and  lim  Z{x)^  [1  -  Fy(a:)]“^  =  0.  Then  the 
form  of  the  density  functional  is 


1  t'  [(F,  -  f,)f 

ih  F' 


(2.3) 
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Some  algebra  yields 


In  terms  of  the  densities  F0  =  fe,  the  AR  estimator  is  that  value  of  6  which 
minimizes 

=  +  p.4) 

The  asymptotic  variance  of  this  AR  estimator  is 

Now  we  characterize  the  limiting  optimization  problem  associated  with 
this  AR  estimator  sequence  in  the  regular  case. 

Theorem  2.5.  For  arbitrary  fixed  n,  let  ^  the  sequence  of  AR  esti¬ 

mators  for  the  parameter  r  of  a  random  sample  density.  Suppose  the  ML  and 
AR  estimation  problems  are  regular.  If  the  AR  estimator  sequence  converges, 
so  that 

6i-^  6  as  i—¥  oo, 

then  6  is  also  the  maximum-likelihood  estimator  of  t. 

Theorem  2.5  states  that  for  density  estimation  based  on  JF},,  the  limiting 
AR  estimator  is  the  maximum-likelihood  estimator. 

2.3.1  Type  I  Censoring 

A  data  set  is  said  to  be  censored  if  a  known  number  of  observations  are 
missing  from  one  or  both  ends  of  the  data  range.  See  David  [12].  In  the 
case  of  Type  I  censoring,  we  observe  the  data  when  they  are  inside  a  known 
range,  say  [a,  6].  If  a  datum  falls  outside  the  range,  we  know  which  side  it  is 
on  but  not  what  its  value  is.  So  the  number  of  observed  data  values  is  itself 
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a  random  variable.  Formally,  we  have  an  underlying  sample  Vi, . . .  ,1^  from 
a  distribution  with  c.d.f.  and  p.d.f.  fs  =  F'g.  However,  we  do  not  observe 
the  Vi  directly.  Rather,  we  observe  the  indicators 

—1,  Yi  <a 
0,  a^Yi^b 
1,  Yi>b 

along  with  the  data 


Yi,  Si  =  0 

not  available,  6i  ^  0. 


The  corresponding  AR  objective  functional  is  based  on  equation  (A.2),  as  is 
the  uncensored  objective  given  by  equation  (2.4).  The  result  is 

1 1  r  /•*  (h?  .  I  1 

2  [y,  /,  F,(o)  1  -  F,{6).  ■ 

A  similar  construction  also  works  for  general  Type  I  censoring,  in  which 
the  observable  data  lie  in  a  union  of  disjoint  intervals.  In  this  case,  data  are 
observed  in 

[ci, 6i]  U  [02, 62]  U  •  •  •  U  [ojfe, 6*], 

where  oi  <  61  <  02  <  •  •  •  <  b^-i  <ak  <bk,  and  data  counts  are  available  for 
each  of  the  complementary  regions  (—00,  oi),  (61, 02), . . . ,  {bk-i,  o*),  (6*,,  00). 

Note  that  censoring  is  different  from  truncation,  in  which  the  distribution 
itself  is  modified  and  the  amount  of  lost  data  is  unknown.  Any  c.d.f.,  say  Fg, 
can  be  truncated  to  the  interval  [0,6].  The  resulting  c.d.f.  Gg  is  given  by 


Gg(x) 


X  <a 
a  ^x  ^b 
x>b. 
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2.3.2  Example:  Linear  Density 

Let  Xx, ...  ,  X„  be  i.i.d.  on  [0, 1]  with  density  function  fb{x)  =b  +  2(1  —  b)x, 
where  b  e  [0, 2].  For  fixed  fa,  the  density  functional  is 


.(&)  =  -/  I 

Jo  CL 


^  6  +  2(1  —  h)t 

+  2(1  —  CL^t 


1  [b  + 2(1 -b)tf 


+  2(1  —  a)t 


dt, 


which  is  quadratic  in  b.  Therefore,  the  AR  estimator,  which  is  the  solution 
of 

minimize  Jo (&)  subject  to  0  ^  6  ^  2, 

b 

can  be  obtained  by  “clamping”  the  minimizer  of  the  unconstrained  objective 
into  the  feasible  region  [0, 2].  So  if  b  satisfies  Ja(b)  =  0,  then  the  estimator  is 

6  =  2  A  (0  V  S)  =  min  ^2,  max(0,  S)  j  . 

Differentiating  with  respect  to  6,  we  obtain 

j'(b)=-  r  ij-mi  r ,, 

Jo  a  +  2(l-a)t'^^"^*^^Jo  a  +  2(1 -a)t 

The  solution  of  Jo(h)  =  0  is 

1  11^  1  -  2Xi 

n  Z—/  . 


6  = 


h(o)  [w  a  +  2(1  —  a)Xi 


where 


and 


T  fn\  -  2t(l  -  2t)  _  o[2  -  2a  -  log(2  -  o)  +  log  o] 

^  Jo  a  +  2(1 -a)t^*  2(1 -ay 
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The  unconstrained  minimizer  is 


£  ,  _ (g  ~  1)^ _ ^  l-2Xi 

n[l-a-ilog(f-l)]|^a  +  2(l-a)X,’ 
which  is  in  fact  continuous  at  a  =  1  with 

n  W 

lim S  =  1  +  - y (1  -  2Xi)  =4- ex. 

o-M  n  ^ 

On  the  other  hand,  the  MLE  is  the  solution  b„  of 

n 

maximize  nif  +  2(1  —  6)Xi]  subject  to  0  ^  6  ^  2,  (2.6) 

*  *=i 

which  does  not  exist  in  closed  form.  However,  Theorem  2.5  implies  that  the 
solution  of  (2.6)  may  be  obtained  as  the  iterated  solution  of  (2.5). 

We  illustrate  these  computations  with  a  small  Monte-Carlo  simulation. 
For  each  sample  size  of  n  =  10,  100,  and  1000,  we  generated  1000  data  sets 
with  a  “true”  parameter  value  of  6  =  0.333  from  the  p.d.f.  f{x)  =  b  +  2(1  - 
b)x.  The  ML  estimators,  fen,  were  computed  using  a  constrained  nonlinear 
minimization  routine  in  S-PLUS,  operating  on  the  negative  log-likelihood. 
The  AR  estimators,  were  computed  with  a  typical  initial  guess  of  bn,o  = 
1.0,  but  occasional  initial  guesses  of  0.5  or  0.1  were  required.  This  happened 
more  often  with  the  smaller  data  sets  (n  =  10)  and  hardly  at  all  with  the 
large  data  sets  (n  =  1000).  Mean-squared  errors  are  presented  in  Table 
2.1.  We  make  several  comments.  For  all  sample  sizes  in  this  simulation,  the 
second-stage  ARE  has  lower  error  than  the  corresponding  MLE.  Theorem  2.5 
states  that  b„,i  ->  6„  as  i  — )•  oo  if  the  sequence  converges.  This  is  true  only  in 
the  regular  cases,  where  estimators  are  obtained  by  zeroing  the  derivatives 
of  objective  functions.  In  this  example,  the  parameter  space  is  constrained, 
and  sometimes  a  solution  is  a  boundary  point  instead  of  a  stationary  point. 
Note  that  even  for  small  samples,  the  second-stage  ARE  is  competitive  with 
the  MLE. 
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Table  2.1.  Mean-Squared  Error  for  AR  and  ML  Density  Estimation  Simu¬ 
lation 


estimator 

n=10 

n=100 

n=1000 

bn 

0.1994 

0.02006 

0.002187 

0.1744 

0.01981 

0.002590 

bn, 2 

0.1851 

0.01899 

0.002177 

bn,Z 

0.1934 

0.01997 

0.002187 

bn,4 

0.1929 

0.01985 

0.002187 

bn,h 

0.1951 

0.02003 

0.002187 

bn, 6 

0.1954 

0.02000 

0.002187 

bn,7 

0.1965 

0.02005 

0.002187 

bn,8 

0.1966 

0.02003 

0.002187 

bn,9 

0.1972 

0.02006 

0.002187 

bn,10 

0.1972 

0.02004 

0.002187 

2.4  Application  to  Quantile  Function  Estimation 

In  this  section,  we  develop  the  AR  estimator  for  the  parameter  of  a  random 
sample  probability  law  based  on  the  quantile  function. 

Let  ^1, . . .  ,Zn  be  i.i.d.  real  r.v.’s  on  J  =  [0, 1]  with  positive  density  /, 
(continuous)  c.d.f. 

F{t)  =  Pr[Z  ^  t], 

quantile  function 


Q{u)  =  F~^{u)  =  inf{t :  F{t)  ^  u}, 
and  density  quantile  function 


g  =  foQ. 
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Differentiating  F{Q{u))  =  u  yields  Q'{u)  F'{Q{u))  =  1,  so 


Relevant  empirical  functions  are  the  empirical  distribution  F„(t),  the  empir¬ 
ical  quantile  function 

Q„{u)  =  F-'W, 

and  the  standardized  quantile  process 

Vniu)  =  y/ng{u)[Qniu)  -  Q(u)]  =  Vn-^^[Qn{u)  -  (?(u)]. 

Under  reasonably  mild  conditions,  the  asymptotic  distribution  of  Vn  is  Gaus¬ 
sian  with  mean  zero  and  covariance  uAv  —  uv.  So  the  asymptotic  charac¬ 
teristics  of  Qn  are 

EQ^{u)  =  Q{u)  and 

K{s,t)  =  Cov[Q;(5),  (5;(t)]  =  ^Q'{s)Q'{t){s  A  t  -  St), 
and  the  asymptotic  model  is 

= <3W + ■  o'WBW. 

where 

Cov[X;(u),X;;(t;)]  =  ^Q'{u)Q'{v){u  Av-  uv). 

Since  the  covariance  function  can  be  written  as 

K{s,t)  =  ^Q'is  A  t){s  A  t)  •  Q'{s  V  t)(l  —  5  V  t), 

the  corresponding  RKHS  inner  product  is  given  by  equation  (A.3)  as 

as  long  as  limx“^  Z{^)^  Q'{^)~^  =  0  2md  hm(l  —  a:)"^  ^{x)^  =  0- 

In  the  parametric  setting,  Q  =  Qt  where  r  €  0  C  for  some  feasible 
parameter  set  ©.  To  set  up  the  AR  estimation  procedure,  fix  7  €  ©  and  let 
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X(t)  =  Xi(t)  =  Qn(i)  -  Qy(t).  Let  Xo(t)  be  a  zero-mean  Gaussian  process 
with  covariance  function  Ky(s,t)  =  n~^  Qy(s)  Qy{t)  '{s  At  —  st).  We  model 
X{t)  as  a  Gaussian  process  with  mean  Qg  —  and  covariance  K^. 

AR  estimation  of  9  is  accomplished  by  minimizing  the  functional 


Multiplying  this  out  and  discarding  terms  constant  with  respect  to  9,  we 
obtain 


which  is  equivalent  for  the  purposes  of  optimization.  To  avoid  distributional 
derivatives,  we  can  integrate  by  parts  to  get 

fWQA'(Qn\^  fWQlY  On 

Jo  \Q'J  [qW  [qW  Q;Io  Jo  [QW  Q'y 

With  appropriate  behavior  at  the  endpoints,  we  have 


2.4.1  Type  II  Censoring 

Another  censoring  mechanism  is  called  Type  II  censoring.  Recall  (section 
2.3.1)  in  the  case  of  Type  I  censoring  that  the  observed  data  are  constrained 
to  a  certain  known  range  and  that  the  number  of  observed  values  is  random. 
The  situation  is  reversed  in  Type  II  censoring:  a  known  number  of  data  are 
excised  from  each  end  of  the  range,  and  the  observed  data  range  is  random. 
Type  I  censoring  is  well  suited  to  AR  estimation  based  on  the  empirical  c.d.f. 
Fn-  In  a  similar  fashion,  AR  estimation  based  on  the  empirical  quantile 
function  can  be  adapted  for  the  Type,  II  censoring  model.  Parzen  [54] 
points  this  out  for  the  special  case  of  location  and  scale  estimation. 

Of  course,  if  we  know  the  underlying  sample  size  n  and  the  numbers  of 
observations  ria  and  Ub  missing  from  each  end,  then  we  also  know  the  points 
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p  and  q  which  delimit  the  domain  of  the  corresponding  quantile  functions.  In 
fact,  p  —  Tia/n  and  q  =  l  -  Uh/n.  For  Type  II  censoring,  the  AR  estimation 
functional  based  on  Qn  is  derived  from  equation  (A.3).  The  result  is 

r  (9il\  Q0(p)Qn{p) _ ]_  Qejq)  Qnjq) 

Jp\Q'J\Q'J  p‘  Q'M  1-9'  W 


2.4.2  Example:  Location  and  Scale  Estimation 

Here  we  consider  AR  estimators  based  on  the  empirical  quantile  function  in 
the  case  of  location  and  scale  families  of  distributions.  For  a  fixed  density 
fo,  we  have  the  two-parameter  family  of  distributions 

where  0  =  {p,  a).  In  this  case,  we  show  that  6i  =  0i  for  all  i  >  1. 

Other  results  (see  sections  2.3  and  2.5)  seem  to  imply  a  direct  relationship 
between  AR  estimators  and  ML  estimators.  However,  based  on  the  work  of 
Bennett  [4],  Parzen  [54],  David  [12],  and  others,  it  is  known  that  0i  need  not 
be  the  MLE. 

To  express  the  location  and  scale  problem  in  terms  of  the  quantile  func¬ 
tion,  we  specify  a  fixed  quantile  function  Qo  and  a  two-parameter  family  of 
candidate  quantile  functions 

{a  -I-  6Q0  :  o  e  R,  6  >  0}. 

For  fixed  7  =  (oo,  bo),  we  have  Q^{u)  =  Uo  +  boQoiu),  and  with  6  =  (a,  b)  we 
have  Q${u)  =  a  +  bQo{u).  The  AR  estimator  6  is  the  minimizer  of 
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or,  equivalently,  of 


Thus,  the  AR  estimator  is  independent  of  choice  of  (ao,6o). 
Let  (x,y)  =  /o^(®/Qo)'(y/Qo)'»  and  define  the  matrices 

(1,1)  Mo) 

(i>  Qo)  (Qo)  Qo) 

Then  we  have 


C  = 


(l,Qn) 

(Qo)  Qn) 


and  R  = 


J{9)  —  —a  (1,  Qn)  —  b  {Qo,  Qn)  +  I  1)  +  (1,  Qo)  +  b^  {Qoi  Qo)) 

=  -9'^C+i9'^R  9, 

so  the  AR  estimator  9,  which  is  the  minimizer  of  J{9),is  given  by 

9  =  R~^  a 


Explicitly,  in  terms  of  the  components,  the  estimator  is 


a 

■(1,1) 

-1 

'<1.0.)' 

b 

(Qoj  Qo) 

(Qoj  Qn) 

To  modify  the  location  and  scale  problem  for  Type  II  censoring  (discussed 
in  section  2.4.1),  the  appropriate  inner  product  is 


{x,y) 


r  ( — ^ + 1 .  ^(p)y(p)  .  1  .  g(g)y(g) 

Jp  \Qo)  \Qo)  p  Q'oip)^  i-g  ' 


The  resulting  estimator  is  the  one  obtained  by  Parzen  [54]. 


2.5  Application  to  Poisson  Process  Intensity 
Estimation 

In  this  section,  we  develop  the  AR  estimator  for  the  intensity  parameter  of 
a  completely  observed  Poisson  process  with  finite  mean  measure.  Again,  the 
prime  denotes  differentiation  with  respect  to  t  6  I. 
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The  RKHS  inner  product  for  a  Poisson  process  (Wiener  process)  covari¬ 
ance  structure,  given  by  equation  (A.l),  is 


as  long  as  limG(x)^iiL’(a:)~^  =  0.  Here,  the  map  ^  is 

MG)  =  l'§;dN, 

and  the  AR  density  functional  for  a  Poisson  counting  process  with  mean  G 
and  intensity  g  is  then 

We  now  characterize  the  limiting  optimization  problem  associated  with 
this  AR  estimator  sequence  in  the  regular  case.  The  following  result  is  the 
Poisson  process  analogue  of  Theorem  2.5. 

Theorem  2.6.  For  arbitrary  fixed  n,  let  the  sequence  of  AR  esti¬ 

mators  for  the  parameter  t  of  a  Poisson  process  mean  measure.  Suppose  the 
ML  and  AR  estimation  problems  are  regular.  If  the  AR  estimator  sequence 
converges,  so  that 

9i  9  as  i  oo, 

then  9  is  the  also  maximum-likelihood  estimator  of  r . 

In  other  words,  if  the  AR  estimator  sequence  converges  (for  any  fixed 
value  of  n),  then  it  converges  to  the  maximum-likelihood  estimator. 
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2.5.1  Example:  Exponential  Intensity 


Consider  a  Poisson  process  on  [0,r]  with  intensity  function  \{t)  =  ae 
The  density  functional  is 

J(a,  b)  =  -  r  -^^  dN{t)  +  l  r  dt 

^  '  Jo  ooe-^°*  ^  ^  2  Jo  aoe-^o‘ 


=  e-ib-bo)t, 


20o  JO 


-  fi  _  e-n26-6o)l 

-  ^2ao{2h-bo)^  J’ 


as  long  as  26  —  6o  >0.  To  accomplish  the  optimization,  we  differentiate  with 
respect  to  o, 

^  =  _  Jl  ^  p-(6-6o)t<  j.  g  [1  - 


dCL  CLq 

and  set  dJ/da  =  0  to  get 


ao(26  —  bo)  ’ 


26 -6o  ^  -(b-bo)u 


yo 

^  “  1  _  Q-T{2b-bo)  2L/® 


Substituting,  we  have 

2b -bo 

oo  [1  -  e-n26-6o)] 


”  1^  r  1 

e-(>-6o)ti  —1  j _ 1 

[h  J  L  '+2[l-e- 


_  e-r(2t-6o)] 


With  T  oo,  this  becomes 


_  26 -6o  -(6-6o)t.- 


So  the  AR  estimator  6  is  the  minimizer  of 


J(b)  =  {bo-2b)  , 
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and  we  recover  a  from  equation  (2.7). 


a=(26-6o)S®'^*” 


ho)ti 


i=l 


On  the  other  hand,  the  likelihood  functional  is 

L=  f^logXdN-  r X 
Jo  Jo 

n  pT 

=  y^(log  a  —  bti)  —  O'  I  dt 

=  n  log  a  —  6  ^  ^  (e“*^  —  l) . 

»=i 

So,  for  large  T,  we  have 


L  =  nloga 


The  derivative  with  respect  to  a  is 


dL  _n  1 
da  a  b' 

so  that  dL/da  =  0  if  and  only  if  a  =  nb.  Substituting,  we  obtain 


n 

L  =  n  log  n6  —  fe  ^  tj  —  n. 

»=i 


Now, 


dL  _n_^ 
db~b 

t=l 


and  dL/db  =  0  if  and  only  if  b~^  =  n~^  U  =  t.  Thus,  we  finally  have  the 
ML  estimators 

j-i  A  u-  ^  -  j-i 

5  —  ±  End  0  ^  ^ 

2^1=1  w 
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2.6  Technical  Details 

2.6.1  Distributions  of  Functionals 

The  following  facts  enable  us  to  calculate  the  asymptotic  distributions  of 
functionals  that  arise  in  the  course  of  the  large-sample  analysis  of  the  AR 
estimator.  Here,  we  make  the  dependence  of  ^  upon  the  process  explicit,  as 
in  section  1.4. 

Lemma  2.7  is  a  standard  result  in  Hilbert-space  time-series  methods. 
Lemma  2.8  is  nonstandard,  but  required  in  our  applications  to  characterize 
the  behavior  of  the  map  (j)  when  the  reference  covariance  structure  is  different 
from  the  true  covariance  of  the  process  X„. 

Lenuna  2.7.  Let  X  he  a  stochastic  process  with  sample  paths  in  8,  mean- 
value  function  M,  and  covariance  function  K.  Let  Hk  he  the  RKHS  with 
reproducing  kernel  K.  Assume  that  M  6  Hk-  Let  <f>  :  %>(.  Hk  R  he  a 
bilinear  functional  vnth  <j){Y,Kt)  =  Y{t)  for  allt  e  I  and  Y  e  B.  Then,  for 
allfeHK, 

E4.{X,f)  =  (M,f)  and  Var  =  ||/f , 

Lemma  2.8.  Let  X  he  a  stochastic  process  with  sample  paths  in  S,  mean- 
value  function  Mg,  and  covariance  function  Kg  under  prohahility  law  Lg  for 
z  €  {o,6}.  Let  Hg  he  the  RKHS  with  reproducing  kernel  Kg.  Assume  that 
Mg  €  Hg.  Let  each  bilinear  functional  (f>g  :  $  x  Hg  R.  satisfy  4>z{Y,  Kgt)  = 
Y(t)  for  all  t  E  I  and  Y  E  B.  Suppose  that  H^  =  Hi,  =  H,  that  the  linear 
operator  T  =  T^a  on  H  given  by  T{Ki,t)  =  Hat  is  bounded,  and  that  the  linear 
functional  E  <f>b{X,  •)  on  H  is  bounded.  Let  S  =  Sba  be  a  square  root  of  T. 
Then  under  law  Ha,  for  all  f  E  H, 

EMHJ)  =  {Ma,f)i,  and  Ys^MX,f)  =  \\Si,af\\l 

Up  to  this  point,  we  have  not  been  specific  about  the  definition  of  con¬ 
vergence  in  distribution  for  random  functions.  There  are  several  different 
general  approaches.  See  Billingsley  [5],  Shorack  and  Wellner  [63],  and  the 


39 


references  contained  in  both  for  discussions  of  the  convergence  of  probabil¬ 
ity  measures  on  metric  spaces.  According  to  Pollard  [56],  for  example,  the 
typical  metric  space  setup  is  as  follows. 

There  are  an  implicit  underlying  probability  space  (f2,3^,Pr), 
and  a  function  space  S  with  cr-algebra  A  and  metric  S.  Always, 

A  C  ®j(8),  the  Borel  <T-algebra  generated  by  the  (J-open  sets  of 
S.  A  usual  choice  is  A  =  ®?(S),  the  o--algebra  generated  by  the 
5-open  balls  in  S.  The  random  elements  Z  :  f2  ->•  S  are  then  the 
T/A-measurable  functions.  Equip  the  reals  with  the  usual  Borel 
cr-algebra  !B.  Let  /  :  S  R  be  bounded,  5-continuous,  and  A/S- 
measurable.  Then,  Z„  Z  in  S  means  that  /(Z„)  f{Z) 
for  each  such  /. 

For  a  different  (metric-free)  approach,  see  Kallenberg  [30],  Karr  [31],  and 
their  references.  In  any  case,  however,  the  useful  and  practical  application 
of  convergence  in  distribution  can  always  be  described  as: 

Given  a  sequence  Z„  of  random  elements  taking  values  in 
a  space  8  and  a  functional  /  :  8  ->  R,  one  seeks  to  establish 
conditions  that  imply  that  the  image  of  /  converges  in  (one¬ 
dimensional)  distribution  in  R;  i.e., 

f{Zn)  f(Z)  as  n  ->  oo.  (2.8) 

In  general,  this  can  be  accomplished  by  identifying  an  appropriate 
mode  of  convergence  in  distribution  in  8;  i.e., 

Zn  Z  as  n  ^  oo,  (2.9) 

and  then  showing  that  /  belongs  to  a  class  of  functions  for  which 
(2.9)  implies  (2.8). 

We  are  not  concerned  with  the  theoretical  details  but  need  only  consider  the 
practical  use  of  the  technology,  since  our  interest  is  in  the  application  of  the 
AR  concept  to  specific  stochastic  processes 
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Therefore,  throughout  this  work  we  assume  that  the  convergence  in  dis¬ 
tribution  of  the  stochastic  processes  specified  in  statements  (1.2)  and  (2.1); 
i.e., 

>*  Ar  as  Ti  — y  cx),  (2.10) 

implies  convergence  in  distribution  of  the  functionals 

^TiAn,T,G)  ^r{Ar,G)  as  n ->•  00  (2.11) 

for  any  G  €  Hr. 

While  the  manner  of  convergence  in  (2.10)  for  each  of  the  processes  we 
consider  may  be  different  (e.g.,  in  the  metric  space  setting,  the  <r-algebra  A 
and  metric  5  are  specific  to  the  application),  statement  (2.11)  holds  nonethe¬ 
less. 

Now  we  consider  the  distribution  of  the  limit  <f>r(Ar,G)  in  (2.11).  It  is 
easy  to  show  that  this  quantity  has  a  normal  distribution.!  This  follows  from 
the  fact  that  the  set  UiKt,  :  n  e  N,  U  ^  I,  a  e  R}  is  dense  in  H, 

upon  consideration  of  a  Cauchy  sequence. 

The  large-sample  properties  of  AR  estimators  developed  in  this  chapter 
are  determined  by  the  asymptotic  behavior  of  (^.y(An,T>  G)  for  various  values  of 
G.  In  particular,  suppose  that  Mg,  Mg,  And  Mg  lie  in  for  all  0  6  0,  where 
the  dot  denotes  differentiation  with  respect  to  9.  In  light  of  the  previous 
development,  the  asymptotic  distributions  of  functionals  of  interest  are 

4>^iA„,r,Mg)  -U  M^^Mg)  ~  iV(0,  \\SyrMgf^), 

^'>l{An,T‘>Mg)  <f>^{Ar,Mg)  ~  iV(0,  ||iS'.yr.^(9||^)>  £Uld 

<t>Mn,T,Mg)  4>^{Ar,Mg)  ~  N{Q,  \\S.,rMgf^). 

The  AR  objective  functional  and  its  ^-derivatives  are 
!„.,(«)  =  \\\Me  -  M,||?  -  M,), 

Ln,'r{&)  =  i^Mg  —  Mr,  ^e)y  and 

Km  =  (Mg  -  Mr,Mg)^  -h  \\Mg\\^^  -  ^<f>,{A.,r,  Me). 
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The  asymptotic  distributions  of  these  quantities  follow  from 

yfn*  ^Ln^y{d)  —  M9)? 

\/n  •  —  (^M$  —  Mr^M^  =  ~07(-4n,r) 

y/^'  hM-{Me-MrA)^-\\M0%  = 

In  particular,  with  6  =  r,  these  become 

y/nLfi^y^T)  =  — 0-y(.An,T> -^t)) 

y/nLn,^{T)  =  -<f>y{An,T,  Mr),  and 

[in,,(T)  -  IIMrII?]  = 

More  details  about  the  behavior  of  functionals  of  the  observed  process  and  the 
AR  estimators  are  developed  as  needed  in  the  proofs  of  the  lemmas,  theorems, 
and  corollaries  of  chapter  2.  These  proofs  are  presented  collectively  in  the 
next  section, 

2.6.2  Proofs 

Proof  of  Theorem  2.1.  Since  Var  Ln^fiB)  0  as  n  ^  cx),  Chebychev’s  in¬ 
equality  gives  convergence  in  probability  of  Ln,>y{B)  to  its  expectation  5  HMu  — 
Mrll^.  In  particular,  Ln,y{T)  0  as  n  — )•  00.  Now  fix  5  >  0  and  let 

e  =  inf{EL„,,,(0):|0-T|^(5}. 

Pick  any  o  <  e/2.  Define  the  events  Aq  =  {|r„  -  r|  <5}  and  Ai  = 
{LuM  <  a}.  For  large  enough  n,  we  have 

Pr  (AiAl)  =  Pr  (L„,7(t„)  <  a  and  |t„  -  t|  >  (5) 

<  Pr  (sup{|L„.,,(^)  -  E :\e-T\^5}>a) 

^  Pr  (sup{|L„,.y(^)  -  EL„,.y(0)| :  0  €  0}  >  a) 

<  a 
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and 


Pr(i4i)  =  Pr(L„,^(r„)  <  a)  ^  Pr(I„,-y(r)  <  a)  ^  1  -  a. 
Partitioning  on  Aq,  we  have 

1  -  a  ^  Pr(>li)  =  Pr(>liAo)  +  Pr(i4ii4o)  <  Pr(i4o)  +  a. 

Putting  this  together,  we  have  Pr(i4o)  >  1  —  2a,  as  required.  □ 


Proof  of  Theorem  2.2.  For  small  enough  S,  the  ball  {9  '.\9  —  t\<  5}  lies 
in  the  interior  of  0.  So  any  minimizer  r„  of  with  |r„  -  r]  <  iJ  also  has 
i/(r„)  =  0.  The  conclusion  follows  by  Theorem  2.1.  □ 


Proof  of  Theorem  2.3.  Sen  and  Singer  [61]  use  the  following  approach 
to  prove  a  theorem  about  the  properties  of  maximum-likelihood  estimators. 
They  attribute  the  technique  to  Le  Cam  [42],  Hajek  [25],  and  Inagaki  [28]. 

Fix  an  arbitrary  K  €  (0,oo),  and  for  |«|  ^  K  consider  the  Taylor’s  series 
expansion  of  about  r. 

where 

A„(m)  =  y/nuLn,j{T)  +  and  r„*(M)  G  (t,  r  -H 

so  that  |r„*  —  r|  <  ^  rT^/^K.  Let 

Zn{u)  =  Ln,7(T„,)  -  L„,.y(r)  =  B„(u)  +  Cn{u)  - 

where  ' 

^ni,^  —  i^Tn*  ~ 

a(^)  =  lK.II?-||Mr||?,  and 
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Then  we  have 


^n(^)  —  2^  •^n,7(‘^)  "I"  2^  ^n(^)  '• 

For  any  e  >  0,  fix  an  no  >  K^/e^  and  note  that  if  n  >  no,  then  n”^/^  K  <£ 
and  |t„*  —  r|  <  e.  Therefore,  for  all  suflBciently  large  n,  we  have 

sup{|J5„(u)| :  |«|  <  FT}  ^  Be  =  sup  |  (m,?  -  Mr,  Me)  :  |^  -  t|  ^  e 
ti  ell'  '  T'l 

^  sup  |||Me  -  Mr  11^  •  ||M(9||7  :  1^  -  r|  <  e| . 

Since  the  continuity  assumptions  imply  that  Be  — 0  as  e  ->  0,  we  have 
|B„(n)|  0  as  n  — 00  uniformly  in  |«|  ^  K.  Likewise,  for  all  large  enough 

n, 

supilCnCu)! :  |n|  ^K}^Ce=  sup  {  IIMtfll^  -  ||Mr||?  :  -  t|  <  e| , 

1  ■ 

and  Ce  — >  0  as  e  — )■  0.  Therefore,  |C„(n)|  0  as  n  -4  oo  uniformly  in 

|n|  <  K.  Again,  for  all  sufficiently  large  n, 

sup{|I>„(n)| :  |«|  sup{|^^(A„,r,M(?  -  Mr)| :  -  t|  <  e}  =  5e(A„,T)- 

«  B 

Since  pe(A»,r)  =  Op(n^/^),  we  have  sup„{|I)„(u)|  :  |n|  <  iif}  =  Op(n^/^),  and 
therefore  sup„{n“^/^|Bn(M)|  :  |n|  ^  K}  =  Op(l)  as  n  — >  oo.  Since  we  also 
have  ||Mr||^,  we  may  conclude  that  uniformly  on  |n|  <  K 

A„(«)  =  y/nuLn,^{T)  +  in^llMrll^  +  Op(l)  as  n  ->•  oo. 

With  Wo  =  —y/nLn>^{T)/\[Mr'^^  and  c  =  ||Mr|l7/2,  we  can  write 

A„(u)  =  c{u  -  uof  -cul  +  gn{u), 

where  sup{|p„(w)|  :  |n|  ^  K}  -4  0  as  n  ->•  oo. 

Define  the  events  A  =  |uo|  <  K  —  a  and  B  =  |u„  —  no|  <  a.  For 
any  small  a  >  0,  choose  K  and  N  large  enough  so  that  Pr(A)  >  1  -  a 
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and  Pr  (sup{|p„(m)|  :  |?/|  <  K]  <  jca^)  >  1  —  a  if  n  >  N.  Let  \n{u„)  = 
inf{A„(u) :  1«|  ^  K}.  Conditional  on  A,  we  have 

B  An(tto)  <  inf{A„(w) :  -  no|  ^  a} 

-cul  +  sup  |Pn(M)|  <  ca^  -  cul  -  sup  |p„(m)| 

N<Jir  |u|<ji- 

sup  |p„(u)|  <  \co?] 

|u|<ji: 

and  since  Pr(B)  ^  Pr(i4nB)  =  Pr(^)  Pr(jB|^)  >  (1-a)^,  we  have  Pr(B)  -> 
1  as  O'  ->  0.  Therefore  —  «o  0  as  n  ->  oo.  But  r„  =  r  +  so 

Since 

V^Z„,y(r)  =  -<i>^{An,r,Mr)  A  -<l>^{Ar,Mr)  -  N{Q,  ||5^rMr||?) 
as  n  00,  the  result  is  established.  □ 

Proof  of  Theorem  2.4.  By  Theorem  2.3,  we  have 

Ari-iv^o, 

so  that,  in  particular,  r„,i  r  as  n  -f  oo.  Thus,  for  arbitrarily  small 
/3  >  0,  there  is  an  iV  such  that  if  n  >  iV  then  Pr(T„,i  €  N{t))  >  1  -  /?. 
Now  let  i  =  2.  Fix  an  arbitrary  K  e  (0,  oo).  Proceeding  as  in  the  proof  of 
Theorem  2.3,  we  can  write  the  Taylor’s  series  expansion  of  about  r 

for  |u|  <  K. 

Ln,T„,i{r  +  :^«)  =  -t'n,T„,i(T)  +  jA„(u), 

where 

An(w)  =  V^^^-^n,T„,i  ('t)  +  \‘^‘Zn,Tn,i^"^n*) 

■\/nuLn,Tn,ii7')  d"  2^  d"  2^  "t"  ^n(w)  +  j 
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in  which 

Bn{u)  =  \  , 

'  '  ^n,l 

C„W  =  ||M,„K„.-||M,||?.,„  and 

Dfti^  =  J  (j4n,TJ -^Tn.  ~  -^r)} 

with  Tn*(w)  =  r„*  €  (T,r  +  so  that  |t„*  —  r|  <  ^ 

For  sufficiently  large  n,  we  have 

sup{|5„(u)| :  |u|  ^  iST}  <  sup  |  (Mb  -  Mr,  Mg^  :  |^  -  t|  ^  e| 

^  sup  jllMfl  -  Mrllrn.i  •  ||Me||r„,i  :  -  t|  ^  e}  , 

and  therefore  with 

Be  =  sup  sup  I \\Me  -  MrW^  ■  \\Mg\\rf :  -  t|  <  e| 

76iV(T)  0  ^  * 


we  have 


Pr 


sup{|B„(u)|  :  |u|  ^k}  4,  Be 


>1-/3. 


The  continuity  assumptions  imply  that  Be  — >  0  as  e  — 0,  so  we  have 
|B„(u)|  0  as  n  ->  00  uniformly  in  |u|  <  K.  Likewise,  for  all  large 

enough  n, 

supdCnCu)! :  |u|  sup  llllMell^  -  |lMrllr„,i  :  1^  -  r|  <  e}  . 


So  with 


we  have 


Ce  =  sup  sup  I  ||Mo||5  -  \\Mr\\y  :  -  r|  <  e| , 

7€JV(t)  0  *•  ^ 


Pr 


sup{|C'„(m)|  :  |«|  ^K}^Ce 


>1-/9. 


The  continuity  assumptions  imply  that  Cg  — 0  as  e  — >  0,  so  we  have 
|C'„(u)|  0  as  n  — oo  uniformly  in  |u|  <  K.  Again,  for  all  sufficiently 

large  n, 


sup{|£>„(u)l  :  |u|  sup{|0r„,i(A»,T,-^?  -  Mr)| :  |0  -  t|  <  e}, 

u  0 
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so 


Pr  |^SUp{|I>„(m)|  :  |li|  ^  K}  <  pe(^n,r)J  >1-/3. 

Since  Pe(i4„,r)  =  Op(n^/^),  we  have  sup„{|D„(w)|  :  \u\  ^  K}  =  Op(n^/^),  and 
therefore  sup„{n“^/^|J?n(«)| :  |«|  <  K}  =  Op(l)  as  n  ->  oo.  The  7-continuity 
of  <f)y  implies  that 

^Tn,i  —  0r(-^n,T)  F)  y  0, 

and  it  is  basic  that 

for  F  =  Mr,  Mti  or  Therefore, 

•^n.Tn.lC'r)  —  ll-^TllTn,!  —  ~^^Tn,l{-^n,TlMr)  y  0, 
and  we  may  conclude  that  uniformly  on  |u|  ^  K, 

An(n)  =  v/nMin,r„.i(r)-h|«2||JWr||r„,i+o  as  n^oo. 

Since  Ln^^{T)  =  —n~^^^<j>y{An,r,Mr),  and  (•, ')^  is  7-continuous,  we  in  fact 
have 

A„(u)  =  y/nuLn^riT)  +  |u^||Mr||T  +  Op(l)  aS  n  -)■  OO 
uniformly  for  \u\  <  K.  As  in  the  proof  of  the  previous  theorem,  we  obtain 


fZ(-  _N  I  V^Fn,T{T)  P ,  n _ _ 

Vn{Tn,2  -T)-\ — - y  0  as  n 


-y  00. 


Since 


VnL„.,(r)  =  -<t>r{An,r,Mr)  -A  -<l>r{Ar,Mr)  ~  N{Q,  1|M,||2), 
we  have  established  that 

VS(r„,,-r)^y,~Ar(o,^)  asn 

By  induction,  we  may  conclude  that 

v/S(r„,-r)-^r,  ~^r(o,^)  as 

for  all  i  >  1. 


00. 


00 


□ 
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Proof  of  Theorem  2.5.  If  the  AR  estimator  sequence  converges  to  some 
value  9,  then  0  is  a  fixed  point  of  the  operator  iS„  defined  on  page  20.  So 
Sn{9)  —  9,  which  means  that 


(=e 


Differentiating,  we  have 


Substituting  ^  =  9,  we  get 

=  -| /log/,<iF„. 

Therefore,  9  solves 

maximize  J  logf^dFn, 

which  characterizes  the  maximum-likelihood  estimator  of  a  random  sample 
density  parameter.  □ 


Proof  of  Theorem  2.6.  If  the  AR  estimator  sequence  converges  to  some 
value  9,  then  0  is  a  fixed  point  of  the  operator  defined  on  page  20.  So 
Sn{9)  =  9,  which  means  that 


(=6 
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Differentiating,  we  have 


Substituting  ^  =  0,  we  get 


-I 


9e 


I 

1'  9id^9i 

9e  J 

9e 

J? 

II 

i«=.  '‘ff+J 

a  . 

=  |[-/log9f<iJV  +  G«(l)] 


Therefore,  6  solves 


maximize 

i 


J  \0%g(dN-G((l), 


which  characterizes  the  maximum-likelihood  estimator  of  a  Poisson  process 
intensity  parameter.  □ 

Proof  of  Lemma  2.7.  See  Parzen  [51].  □ 


Proof  of  Lemma  2.8.  We  suppress  the  dependence  of  ^  on  S  and  write 

<!,{})  (ax  ■KX,!). 

Since  E  is  a  continuous  linear  functional  on  H,  there  is  a  unique  s  €  JST 
with  E^b{f)  =  (s,/)j  for  all  /  €  if.  In  particular,  E(f>(,{Kbt)  =  {s,Ki,t)b  = 
s{t).  On  the  other  hand,  we  have  EcpbiKu)  =  EA’(t)  =  Ma{t).  Therefore, 
s{t)  =  Ma{t),  and  E  (f>bif)  =  {Ma,  f)b  as  required. 

Consider  the  bounded  linear  operator  T  =  Tba  on  H  given  by  (Tf)(x)  = 
{f,Kax)b(  and  the  associated  bounded  bilinear  functional  B{f,g)  =  {Tf,g)f^. 
Note  that 

iTKbt){x)  =  {Kbu  Ka,)b  =  KaS)  =  Kat{x). 
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B{Ki„Ku)  =  {TKt>,Kbt\  =  {Kas,Kf^),  =  K^sit)  =  Ka{s,t). 


Also, 


But  since 

K^is,t)  =  Coy[X{s),X{t)]  =  Coy[<f>i,{Kts),MKbt)], 

we  have  in  general  B{f,g)  =  Covy>b{f)^<l>b{9)]  and  B{f,f)  =  Wax  (f>b{f). 
Since  B  is  positive  definite,  symmetric,  and  self-adjoint,  we  know  that  T  has 
a  positive  definite,  symmetric,  self-adjoint  square  root  S  =  Sba-  So  we  can 
write  (T/,  f)^  =  {Sf,  Sf),,  =  ||5/||^,  whence  in  fact 

Var  Uf)  =  II  Vll'  =  «/>^«(-)>6  «/(•)>, 

for  all  feH.  □ 
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3.  Nonpareunetric  AR  Estimation 


In  this  chapter,  we  consider  asymptotic  regression  estimation  in  the  setting 
of  an  infinite-dimensional  real  parameter  space.  As  one  would  expect  in  this 
case,  even  for  a  fixed  initial  guess  7,  the  minimization  problem  either  has 
no  solution  or  results  in  an  estimator  that  is  not  smooth  enough.  In  other 
words,  the  problem  is  ill-posed.  However,  we  show  that  the  technique  of 
regularization,  or  penalization,  can  be  applied  to  these  problems  to  produce 
estimators  that  have  desirable  asymptotic  properties. 

We  address  nonparametric  estimation  in  the  context  of  specific  applica¬ 
tions.  Basic  density  estimation  is  the  subject  of  section  3.1.  Density  estima¬ 
tion  for  inverse  problems  is  considered  in  section  3.2.  Section  3.3  contains 
a  short  discussion  of  the  application  of  ARE  to  Poisson  process  intensity 
estimation.  Proofs  are  deferred  to  section  3.4. 

3.1  Density  Estimation 

Let  Xi, . . .  ,  Xn  be  i.i.d.  random  variables  on  7  C  E  with  c.d.f.  Fo  and  p.d.f. 
Fo  =  fo‘  Let  h  be  a  p.d.f.  on  7.  In  what  follows,  f  means  fj.  With  the 
objective  functional 

and  constraint  setC={/:/^0,  //  =  l},  the  natural  nonparametric  ver¬ 
sion  of  the  AR  density  estimation  problem  is 

minimize  Jn,h{f)  subject  to  /  G  L2  H  6.  (3.1) 

However,  this  problem  is  ill-posed  in  the  following  sense. 

Theorem  3.1.  Problem  (3.1)  has  no  solution. 

In  the  proof,  we  construct  a  sequence  in  L2nC  with  the  property 

that  J*{gm)  —00  as  m  — »•  00,  thereby  showing  that  J*  is  unbounded. 
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In  fact,  Gm{t)  =  gmi^)  dx  converges  in  L2  to  F„,  the  empirical  c.d.f. 
Since  the  optimization  problem  (3.1)  has  no  solution  in  L2,  it  is  useless  for 
estimating  /„  and  its  derivatives. 

The  same  thing  happens  when  one  attempts  to  extend  maximum  likeli¬ 
hood  density  estimation  by  relaxing  the  parametric  restriction.  The  problem 

n 

maximize  f{Xi)  subject  to  /  €  L2  n  6  (3.2) 

^  »=i 

has  no  solution.  The  proof  of  this  fact  utilizes  the  same  construction  as 
Theorem  3.1.  See  Thompson  and  Tapia  [73]. 

One  approach  that  has  been  exploited  is  to  replace  problem  (3.2)  by  an 
approximate  version  that  hzis  a  useful  solution — ^namely, 

n 

maximize  JJ  f{Xi) exp[— nai/(/)]  subject  to  f  £  L2  DC.  (3.3) 

^  i=i 

Note  that  equivalent  formulations  of  problems  (3.2)  and  (3.3)  are 
minmize  —  J  log/(x)djP„(x)  subject  to  /G  2^2  0  6 

and 

minimize  -  J  logf{x)  dFn{x)  +  au{f)  subject  to  /  €  L2  H  6, 
respectively. 

The  functional  u,  known  as  a  penalty  functional,  is  chosen  to  give  larger 
values  when  /  is  “less  smooth.”  Solutions  to  (3.3)  are  called  maximum  pe¬ 
nalized  likelihood  density  estimators.  The  parameter  a  is  a  positive  real 
number.  Typically,  when  a  — 0  at  some  rate  as  n  — 00,  one  obtains  a  se¬ 
quence  of  estimators  with  good  asymptotic  properties.  The  literature  is  rich 
with  references  to  work  in  related  areas,  such  as  regularization  in  Tikhonov 
and  Arsenin  [74];  maximum  penalized  likelihood  density  estimation  in  Good 
and  Gaskins  [21],  de  Montricher,  Tapia,  and  Thompson  [13],  Silverman  [66], 
and  Thompson  and  Tapia  [73];  smoothing  splines  in  Wahba  [76],  [79],  and 
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Gu  [23];  nonparametric  estimation  and  regression  in  Stone  [70],  Klonias  [34], 
Eubank  [16],  Gu  [24],  and  Hardle  [27];  and  inverse  problems  in  O’Sullivan  [50] 
and  Cox  [9]  . 

We  can  apply  the  method  of  penalization  to  problem  (3.1)  and  obtain 
a  related  problem  that  does  have  a  well-behaved  unique  solution.  We  now 
formulate  the  penalized  problem. 

Let  Xi, . . .  ,X„  be  i.i.d.  random  variables  defined  on  a  bounded  domain 
/  C  R,  with  c.d.f.  Fo  and  p.d.f.  /<,  =  The  function  h  is  a  smooth  p.d.f.  on 
I.  Denote  by  D  a  linear  differential  operator  of  order  p  ^  1,  with  no  constant 
term,  defined  on  a  suitable  domain  with  appropriate  boundary  conditions. 
We  refer  to  the  positive  real  constant  a  as  the  smoothing  parameter.  Using 


the  penalty  functional 

we  define  the  penalized  AR  density  estimation  functional  to  be 

J.M)  =  +  +  <=•’'(!)■  (5-5) 

The  corresponding  single-step  AR  density  estimation  problem  is  then 

minimize  Jn,h{f)  subject  to  /  €  IKp  H  6,  (3.6) 

and  its  solution  f  =  fn  satisfies 

J„.fc(/)  =  inf  {Jn,h{f) :  /  e  n  e} .  (3.7) 

For  fixed  n  and  /„,o,  the  recursive  sequence  of  AR  estimators  {fn,o,  fn,i, 
/n,2»  •  •  • )  is  characterized  by 

=  (3.8) 


In  order  to  establish  the  existence  and  uniqueness  of  the  solution  of  prob¬ 
lem  (3.6),  we  exploit  the  inner-product  structure  implicit  in  the  form  of  the 
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penalized  objective  functional  (3.5).  Specifically,  the  deterministic  part  of 
Jn,h{f)  is  a  quadratic  form  that  arises  from  a  Sobolev  space  norm.  Note  also 
that  these  are  the  penalty  functionals  that  generate  smoothing  splines. 

In  our  setting,  the  Sobolev  spaces  IKp  are  Hilbert  spaces  of  functions  that 
have  (Lebesgue)  square-integrable  derivative. 

:Kp  =  {f:f^^eL2nD},  (3.9) 

where  the  domain  D  formally  incorporates  both  the  function  support  and 
also  the  boundary  conditions  appropriate  to  the  application.  Since  domains 
and  boundary  conditions  are  application-specific,  we  generally  suppress  their 
explicit  formulation  and  simply  remember  that  they  are  a  required  element 
of  the  problem  statement. 

Some  basic  references  for  Sobolev  spaces  and  optimization  are  Adams  [1], 
Atteia  [3],  Kufner  [35],  and  Luenberger  [46].  However,  we  only  need  a  few 
facts  about  Sobolev  spaces  and  a  theorem  of  Thompson  and  Tapia  in  order 
to  give  the  existence  euid  uniqueness  statement.  The  standard  Sobolev  inner 
product  is  given  by 

»=o  •' 

In  our  applications,  a  natural  norm  for  !Kp  is  given  by 

11/115,1.  =  //=+ /(®/)=- 

Weighted  Sobolev  spaces  incorporate  a  weight  function  into  the  integral  def¬ 
inition  of  the  inner  product,  e.g., 

11/115,. =E/|/''*wr  »(*)<« 

»=o  •' 

or  r  r 

11/115,0...  =  J  fHt)w{t)dt+ j  ([I>/I(t)f  »(«)*. 

Recall  that  two  norms  ||  •  ||i  and  ||  •  ||2  on  a  space  8  are  equivalent  if  there  are 
positive  real  constants  a  and  b  such  that 

a||a:||i  ^  ||a;l|2  ^  &||a;||i,  V®  €  8, 
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in  which  case  the  two  norms  in  question  give  rise  to  the  same  topology,  or 
notion  of  convergence,  on  8. 

For  certain  classes  of  weights  w,  the  corresponding  weighted  spaces  coin¬ 
cide  with  the  unweighted  spaces.  This  occurs,  for  example,  if  there  exist  6 
and  A  with  0  <  J  ^  w{t)  <  A  <  oo  for  all  t,  in  which  case  it  is  also  obviously 
true  that  the  weighted  and  unweighted  norms  are  equivalent.  Unweighted 
spaces  are  of  course  weighted  with  w  =  1.  These  various  norms  are  related 
in  this  context  by  the  following  useful  fact. 

Lemma  3.2.  Let 

«,=  {/:  €  L,{I)}  , 

where  I  CR  is  hounded.  Suppose  there  exist  5  and  A  such  that  w  :  I  -^R 
satisfies  0  <  S  ^  w{t)  ^  A  <  oo  for  all  t  €  I.  Suppose  V  is  a  linear 
differential  operator  of  order  p  ^  1  with  no  constant  term.  Then  the  following 
quadratic  forms  engender  equivalent  norms  on  JCp.* 

11/11^.. =E/i/“’r. 

t=0  •' 

miv  =  f  f^+ f\Df?- 

To  establish  the  existence  and  uniqueness  of  the  solution  to  problem  (3.6), 
we  use  an  optimization  theorem  of  Thompson  and  Tapia  [73]. 

Theorem  3.3.  Let  !K  be  a  Hilbert  space,  let  Q  be  a  closed  convex  subset  of 
TC,  and  let  the  functional  J  :  —^R  be  continuous  in  G  and  twice  Gateaux 

differentiable  in  G  with  second  Gateaux  derivative  uniformly  positive  definite 
in  G.  Then  the  problem 


minimize 

/ 


J{f)  subject  to  f  G  G 


has  a  unique  solution  in  C. 
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Using  Lemma  3.2,  we  easily  verify  that  problem  (3.6)  satisfies  the  condi¬ 
tions  of  Theorem  3.3.  This  implies: 

Corollary  3.4.  Problem  (3.6)  has  a  unique  solution. 

Refer  to  the  discussion  immediately  following  Theorem  3.1  and  note  that 
the  functions  =  G'^  constructed  in  the  proof  of  that  theorem  are  discon¬ 
tinuous.  Hence,  they  do  not  lie  in  if  p  ^  1.  Of  course,  no  sequence  Qm  in 
yCp  has  the  property  that  the  penalized  objective  functional  J(pm)  ~oo  as 
m  oo. 

On  the  other  hand,  if  the  penalization  operator  D  for  problem  (3.6)  is 
of  order  p,  the  problem  has  a  unique  solution  with  at  least  p  —  1  continuous 
derivatives.  Thus,  one  may  choose  the  penalization  order  based  on  which 
function  (derivative  of  Fg)  one  wishes  to  estimate. 

As  in  the  parametric  case,  we  have  a  theorem  characterizing  the  limit  of 
the  estimator  sequence. 

Theorem  3.5.  With  n  fixed,  if  the  estimator  sequence  associated  with  prob¬ 
lem  (3.8)  converges,  its  limit  is  a  maximum  penalized  likelihood  estimator. 

The  corresponding  penalty  functional  has  first  Gateaux  derivative 

This  functional  itself  does  not  seem  to  have  a  closed-form  representation, 
although  it  closely  resembles  the  Good  and  Gaskins  [21]  penalty. 

3.1.1  Representation  of  the  Density  Estimator 

In  order  to  understand  the  properties  of  the  penalized  AR  estimator,  we 
characterize  it  as  the  solution  of  a  certain  differential  equation.  The  solution 
can  be  represented  as  a  generalized  kernel  density  estimator  by  invoking 
the  superposition  principle  for  differential  equations.  Some  details  about 
properties  of  the  eigenvalues  and  eigenfunctions  of  the  differential  equation 
are  also  useful. 


56 


Let  Xi,  ...,Xn  be  i.i.d.  random  variables  with  c.d.f.  Fo  and  p.d.f.  /<,.  Let 
D  be  a  linear  differential  operator  of  order  p  ^  1,  with  no  constant  term, 
defined  on  a  domain  with  appropriate  boundary  conditions.  For  fixed  h  we 
obtain  the  AR  density  estimator  as  the  solution  /  of  the  problem 

minimize  J{f)  =  -  f  +  l  j  J +  ^  J ^  (31») 

subject  to  /  ^  0  and  J  f  =  1. 

The  standaxd  inner  product  on  L2  is  {x,  y)  =  J  xy,  and  the  corresponding 
square  norm  is  |lx|p  =  {x,x).  With  the  weight  function  w{t)  =  l/h{t),  we 
can  identify  the  Hilbert  space  1-2,^,  which  has  inner  product  {x,  y)^  =  /  xyw. 
Note  that  {x,y)^  =  {x,wy).  Let  Sg{t)  denote  the  unit  point  measure  at  s. 
The  empirical  point  measure  is  then  fn  =  =  n~^  so  that  F„(t)  = 

J^fn{z)dz  =  f*F^(z)dz  =  fg  dFn(z).  We  can  now  write  the  objective 
functional  of  (3.10)  as 

To  describe  the  solution  of  problem  (3.10),  we  introduce  adjoint  operators 
in  Hilbert  space.  The  L2  adjoint  D*  of  D  is  characterized  by  (!Dx,y)  = 
(x,  !D*y),  with  x  in  the  domain  of  T>  and  y  in  the  domain  of  D*.  With 
respect  to  L2,wi  the  adjoint  D"*"  of  D  satisfies  (Dx,y)^  =  (x,  The 

adjoints  D*  and  D'*'  are  related  by 

y}^  =  ('Dx,  wyf  =  (x,  ‘D*wy)  =  (x,  iT>*wy)^  =  (x,  T>+y)^ . 

Thus,  we  have  D"''  =  ^'D*w.  Define  the  operator  T*,  by 

T«,  =  D+D  =  -V*wD, 
w 

where  T^,  is  taken  to  be  self-adjoint,  so  that  formally 

(!Dx,2)y)^  =  (x,T„y)^  =  (T,„x,y)^.  (3.11) 

In  what  follows,  we  assume  that  (3.11)  always  holds. 
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At  this  point,  it  is  beneficial  to  digress  and  look  at  a  few  specific  dif¬ 
ferential  operators  and  their  adjoints.  In  fact,  these  are  the  operators  D 
used  in  most  of  the  numerical  calculations  in  chapter  4.  We  set  to  =  1  and 
T  =  Tto  =  !D*D  for  these  two  examples. 

First,  consider  Vz  =  z'  on  the  domain  [0, 1]  with  the  boundary  conditions 
z(0)  =  z{i)  =  0.  Then  we  have 

{'Dx,y)=  f  x'y  =  xy\  -  f  xy' =  {x,'D*y) , 

Jo  lo  Jq 


so  that  V*z  =  —z'.  Note,  however,  that  functions  z  in  the  domain  of  D*  are 
not  necessarily  subject  to  z(0)  =  z{l)  =  0.  Now  observe  that  with  x  and  y 
in  the  domain  of  D*D,  we  have 


and  (3.11)  is  satisfied. 

Next,  let  Dz  =  z"  on  [0, 1]  with  the  boundary  conditions  5;(0)  =  «(1)  =  0 
and  ^'(0)  =  z'(l)  =  0.  In  this  case,  we  have 


so  ‘D*z  =  z!'.  Again,  the  domain  of  D*  is  not  subject  to  the  boundary 
conditions  imposed  upon  the  domain  of  D.  With  x  and  y  in  the  domain  of 
!D*!D,  we  obtain 


x'y^^^  =  —xy^^^ 

0 

0 


+  f  xy^*^  =  {x,7y) 

Jo 

+  f  x^^^y={7x,y), 

Jo 


and  (3.11)  holds  here  also. 
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To  continue  with  the  discussion,  note  that  the  penalty  functional  cau  now 
be  written  as 

\\Dx\\l  = 

Let  J  denote  the  identity  transformation,  and  define  the  operators 

^  “1“  oc7yj  and  ^ot^w  “ 

Then  we  have 

ll/l£  +  «l|B/l£  =  (/.2«../>,  (3-12) 

where  is  a  self-adjoint  operator.  The  AR  objective  functional  and  its 
first  and  second  Gateaux  derivatives  now  become 

J(/)=  -{fJn)^  +  i{f,Qa,.f)^, 

J' if ){’>')=  -{rJn)rv  +  {r,Q<x,v,f)y,,  and 
J"if)ir^s)  =  {r,Qa,^s)^. 

To  accomplish  the  minimization  of  J{f),  we  let  J'{f)ir)  =  0  for  all  r. 
Thus,  we  obtain  the  weak  differential  equation  representation  of  the  AR 
estimator  as  the  solution  of 


Qa,«,/  =  fn.  (3.13) 

In  terms  of  the  inverse  operator,  the  solution  can  be  written  as 

f  =  ^a,wU  (3.14) 

To  develop  the  generalized  kernel  representation  of  the  AR  estimator,  we 
introduce  the  kernel  Za,w,s,  which  satisfies  the  equation 

^a,v)^a,w,s  ~  (3.15) 

Since  fn  =  „  ]C"=i  ^Xi,  we  can  use  (3.15)  to  write  /„  as 

1  ” 

fn  ~  ^a,wZa,w^i  ~  ^a,w 


1  ” 


L  <=1 
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Then,  by  (3.13),  we  have 


Qa,ii>/  —  Qa,ti 


1  ” 

~  Za,w,Xi 
<=1 


This  yields  the  generalized  kernel  representation  of  the  AR  estimator — 
namely, 


m 


=  ^  E  W  =  f  Zw(<)  (3-16) 

”  i=l  •' 


3.1.2  Special  Cases 

Now  we  take  a  brief  look  at  two  particular  examples.  These  are  interesting 
in  one  respect  because  we  can  perform  the  calculations  and  exhibit  closed- 
form  solutions.  However,  and  more  importantly,  we  see  in  these  cases  that 
the  ARE  is  in  fact  a  kernel  density  estimator  with  a  familiar  kernel.  So, 
the  properties  (including  asymptotic  theory)  of  this  ARE  are  already  well 
investigated. 


3. 1.2.1  Boundary-Corrected  Kernel  Density  Estimator.  In  this 
example,  we  use  the  first-order  differential  penalty  operator  T>x  =  x'  on 
[0, 1]  with  boundary  conditions  a;'(0)  =  x'(l)  =  0.  The  reference  distribu¬ 
tion  is  uniform,  so  If(t)  =  t  and  H'{t)  =  h{t)  =  ^(t)  =  1.  For  notational 
convenience  we  use  as  the  smoothing  parameter,  and  we  write 

in  place  of  The  adjoint  operator  is  T)*x  =  -x'  with  bound¬ 

ary  conditions  a:(0)  =  x(l)  =  0,  and  the  quadratic  form  operator  is  given 
by  Qfif  =  f  —  Then  the  kernel  satisfies  the  weak  differential 

equation 

(3.17) 

subject  to  the  boundary  conditions  Z'^  ^^Q)  =  ^^,,(1)  =  0.  The  solution  is 
seen  to  be 


csch  P  cosh[^(s  A  t)]  cosh[)0(l  —  s  V  i)] 
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for  (s,<)  e  [0,1]  X  [0,1].  We  verify  that  this  is  the  solution.  Let  Z/3^g(t)  = 
U'{t),  where 


u{t)  -  p-^u"{t) 


0,  t^s 

1,  t>s 


subject  to  17(0)  =  U"{0)  =  U"il)  =  0.  With 


U{t)  =  { 


csch  ^  cosh[/?(l  —  s)]  sinh(/3t), 

1  —  csch  /?  cosh(/3s)  sinh[^(l  —  t)]. 


t^s 
t>  s, 


z^At)  =  U'(t)  = 


/?  csch  f3  cosh[)5(l  —  s)]  cosh(/5t), 
csch /3  cosh(/3s)  cosh[j5(l  —  t)], 


t^s 

t>s^  and 


/3^  csch  P  cosh[/3(l  —  s)]  sinh(/3t),  ^  t  ^  s 
—0^  csch  P  cos\i{Ps)  sinh[/3(l  —  t)],  t>  s, 

we  see  that  the  equation  and  boundary  conditions  zire  satisfied.  We  can  use 
the  standard  identities 


=  U"{t)  =  { 


sinh(a;  +  y)  =  sinh  x  cosh  y  +  cosh  x  sinh  y, 
cosh(x  +  y)  =  cosh  x  cosh  y  +  sinh  x  sinh  y,  and 
1  =  cosh^  X  —  sinh^  x 


to  see  that  U  is  continuous  at  s.  Thus,  in  this  case,  the  ARE  is  a  boundary- 
corrected  kernel  density  estimator  with  a  bilateral  exponential  kernel. 


3.1. 2.2  Kernel  Density  Estimator.  The  example  of  the  previous  sec¬ 
tion  can  be  extended  to  the  real  line  as  follows.  We  use  the  same  differential 
penalty  operator  “Dx  —  x'  on  the  interval  [— M,  M].  The  boundary  conditions 
for  2)  are  x'{—M)  =  x'{M)  =  0,  and  the  solution  of  (3.17)  is  then 

Zi3,a{t)  =  P  csch(2/?M)  cosh[j0(s  At  +  M)]  cosh[/3(M  —  s  V  t)] 


on  [— M,M].  Equivalently,  we  can  write 


Zp,${t) 


P  csch(2/3M)  cosh[/3(t  -I-  M)\  cosh[^(M  —  s)], 
P csch(2/3M)  cosh[/0(s  -f  M)]  cosh[P{M  —  t)]. 


-M  <  t  <  s 
s  <t  ^M. 


61 


Since  sinh  x  ~  cosh  x  ~  |e*  as  x  -4  oo,  we  have  fortes 
Z.  (t)  ~  ^  _ 2- _ = 


and  for  t  >  5 


0,IqI3{»+m) ,i  m-t)  0  , 

z.  (t)  ^  ^  2  _ 2^  =  (iel3(»-t) 

"^.<w  2^ 


Putting  this  together  gives 


and  as  M  ->  00  the  AR  density  estimator  f{t)  =  dFn{s)  satisfies 


Z^t)  ~ 


/(t)~/(t)=  r  Kp,,{t)dFn{s). 
J—00 


Of  course,  /(t)  is  the  convolution  kernel  density  estimator  with  a  bilateral 
exponential  kernel.  In  fact,  since 

f  \Z0,B{t)  -  A^,«(t)|  dt  =  2e~^^  cosh /?s, 

J  —OO 

we  can  establish  that 

ll/(<)  -  /(<)IUi  ^  2e"^^  J^  cosh^X,-. 

»=i 

Therefore,  this  AR  estimator  on  [—M,  M]  converges  in  Li  to  a  kernel  density 
estimator  as  M  ->  oo. 


3.1.3  Consistency  and  Rates  of  Convergence 

Consistency  results  and  convergence  rates  are  expressed  in  terms  of  distance 
measures  on  the  space  of  functions  containing  the  true  parameter,  and  the 
estimates.  To  obtain  results,  one  must  typically  use  a  sequence  of  smoothing 
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parameters  a„  satisfying  0  as  n  — >  oo,  where  n  is  the  sample  size.  We 
make  these  dependencies  explicit  by  writing  the  solution  of  (3.7)  as 

/  =  (3-18) 

Then  one  considers  the  convergence  (in  some  sense)  of  fn,a„,w  to  /<>  as  n  ->  oo. 

On  the  other  hand,  for  fixed  n  and  sufficiently  smooth  p.d.f.  /„,o,  we 
have  the  recursive  AR  estimator  sequence  {fn,o,  fn,i,  fn,u  ■  ■  ■)  characterized 
by  (3.8) — namely, 

fn,i+l  ~  fn,an,v>ii  where  Wi  =  Xf (3.19) 

In  the  following  subsections  we  consider  the  properties  of  a  single-stage 
AR  estimator  (3.18),  which  means  that  w  is  taken  to  be  fixed.  We  are 
not  concerned  at  the  present  time  with  the  properties  of  the  recursive  ARE 
sequence  (3.19). 

We  can  approach  these  problems  on  a  case-by-case  basis,  considering 
individual  (differential  operator  and  boundary  condition)  configurations  as  in 
section  3.1.2.  Alternatively,  we  can  adopt  a  more  general  approach.  At  least 
two  essentially  different  general  approaches  to  similar  problems  appear  in 
the  literature.  One  is  Bosq  and  Lecoutre’s  [6]  probability-theoretic  analysis 
of  generalized  kernel  density  estimators.  Another  is  the  spectral  anal3rsis 
method  used  by  Silverman,  Cox,  O’Sullivan,  and  Wahba.  (See  the  references 
on  page  53.)  We  now  apply  each  of  these  techniques  in  an  attempt  to  obtain 
general  results  about  the  consistency  and  rates  of  convergence  of  the  single- 
step  AR  density  estimator. 


3.1.3.1  Generalized  Kernel  Analysis.  For  an  AR  estimator  fn,an,v>  of 
/o  with  domain  J,  we  define  the  pointwise  distance  measure 


Dn{t)  —  fn,an,vi{i)  fo{t)  j 


tel. 


For  G  C  J,  we  use  the  restricted  global  distance  measure 


dG{9,f)  =  s\ip\g{t)- 
t^G 


63 


Now  consider  the  AR  density  estimator  constructed  with  a  fixed  (initial 
guess)  weight  function  w.  Prom  section  3.1.1,  we  have  the  generalized  kernel 
density  estimator  representation 


fn,an,w{i)  —  ^  ^ 

Ji 


where  the  kernel  Za,w,$  satisfies  the  equation 


The  generalized  kernel  density  estimator  has  been  analyzed  by  Bosq  and 
Lecoutre  [6].  We  apply  their  theorem  in  the  case  of  a  fixed  weight  w  and 
state  the  result  here. 

Let  I  =  [0,1].  We  consider  the  measure  space  (J,  ®,A),  where  A  is  the 
Lebesgue  measure  on  I  and  3  is  the  Borel  <T-algebra.  The  space  T  is  the 
functional  domain  of  the  AR  estimation  problem.  For  G  C  I,  define  the 
space  3^0  by 


3g  =  \  f  '•  f  E  sup 


[(3!o,.-3)/lW|=o}. 


The  intent  is  that  3g  consists  of  functions  /  that  are  sufficiently  well-behaved 
on  G  so  that  0ia,wf  /  in  the  manner  indicated  as  a  -)■  0. 

The  five  conditions  that  follow  are  part  of  the  necessary  and  sufficient 
criteria  for  Bosq  and  Lecoutre’s  results  on  arbitrary  kernels.  Our  kernels 
arise  ais  solutions  of  certain  linear  diflferential  equations,  which  should  imply 
that  the  first  four  conditions  hold.  The  conditions  can  be  verified  on  a  case- 
by-case  basis  for  individual  combinations  of  differential  operator,  boundary 
conditions,  and  domain.  Loosely  speaking,  conditions  (1)  and  (2)  are  bounds 
on  the  supremum  and  L2  norm  of  Z,  respectively,  and  conditions  (3)  and  (4) 
relate  to  the  continuity  of  Za,w,x^)  in  x  and  i,  respectively.  Condition  (5) 
involves  the  metric  structure  of  I  and  G,  and  holds  when  G  is  an  interval. 
The  conditions  follow. 

Suppose  that  there  is  a  positive  constant  and: 
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(1)  There  is  a  bounded  function  ^4  :  G  — K,  and  for  each  t  £  G  there  is 
an  tto  such  that  if  a  <  cto  then 

sup  l^a, «,*(<)!  ^  a~^A{t). 

*€/ 

(2)  There  is  a  bounded  function  B  :  G  — f  R,  and  for  each  t  £  G  there  is 
an  tto  such  that  if  a  <  «„  then 

(3)  There  are  a  function  Qo  £  Jg,  a  point  to  €  G,  and  a  family  {®a  :  a  >  0} 
of  Borel  subsets  of  I  such  that 


Qoito)  ^  0, 

A(®a)  =  ko^  for  A:  >  0,  and 

/  Za,w,x%)9o{x)  dx'^  S  for  S>  0. 

J'Ra. 

(4)  (J,  !B)  is  a  metric  space  with  Borel  a-algebra  S,  and  Za,w,x  satisfies  the 
Lipschitz  condition 

sup  -  Za,w,xit)\  <  Ca~”^\t  -  t'\'^ 

xei 

for  some  G  >  0,  m  >  0,  and  7  >  0. 

(5)  (/,  3)  is  a  metric  space  with  Borel  a-algebra  3,  and  G  €  3  is  precom¬ 
pact  (i.e.,  has  compact  closure).  0  <  A(G)  <  00,  and  there  is  an  h  >  0 
such  that  for  all  e  small  enough,  G  can  be  covered  by  [e~*]  balls  of 
radius  ^e. 

Next  is  a  condition  on  the  convergence  of  the  smoothing  parameter  se¬ 
quence.  The  sequence  an  is  called  asymptotically  concave  if  there  are  a  con¬ 
cave  function  positive  constants  Ci  and  C2,  and  an  integer  such  that 

cip(n)  <  a„  <  C2p(n),  Vn  >  Uo- 
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The  mode  of  convergence  used  in  Theorem  3.6  is  almost  complete  conver¬ 
gence,  denoted  X  and  defined  by 

OO 

X„^X^We>0,  '^Pv[diXn,X)  ^  e]  <  OO. 

n=l 

Almost  complete  convergence  implies  almost  sure  convergence,  denoted 
X„^X  and  defined  by 

Xn^X^  Pr[d(X„,X)  -)■  0]  =  1. 

In  turn,  almost  sure  convergence  implies  convergence  in  probability,  denoted 
Xn-^X  and  defined  by 

Xn  Ve  >  0,  Pr[d(X„,X)  ^  e]  ->  0. 

Finally,  we  state  the  results  of  Bosq  and  Lecoutre.  The  theorem  guarantees 
strong  uniform  consistency  of  the  generalized  kernel  density  estimator.  The 
corollary  establishes  the  rate  of  convergence. 

Theorem  3.6.  If  the  preceding  conditions  (1)  through  (5)  hold  and  a~^  is 
asymptotically  concave,  then  the  following  conditions  are  equivalent. 

(1)  n~^Q.~^  log  n  0. 

(2)  Dn{t)  ^  0,  ED„(t)  ^  0;  t  €  G,  fo  G  Jg- 

(3)  dG(/n, /o)  ^  0,  EdG(/«.a„,«„/o)  0/  fo  G  Tg- 

Corollary  3.7.  Under  the  preceding  conditions,  there  is  a  6  >  Q  such  that 
for  all  n  large  enough  and  for  alle>  0 

Pr[dG(/n,a„,ti>, /o)  >£]<  2exp{-5£^no^). 

As  a  consequence  of  the  corollary,  if  we  choose 

e„  =  A  •  (logn)^/^n~^/^a“^/^. 


66 


we  then  obtain 


Pr  [(logn)  daif n, a„,w,  fo)  ^  aJ  <  2exp(-JA^logn)  =  2n 

for  any  A  >  0.  Then  the  corresponding  rate  of  convergence  in  probability  is 

(logn)~^/V/2a!^/2dG(/n,a„,»«,/o) -^0  as  n^oo. 

We  can  show  that  jS  =  (2p)~^,  where  p  is  the  order  of  the  differential  penal¬ 
ization  operator.  The  convergence  conditions  are  then 

|2p 


ttn  ->  0  and  Q!„ 


n 


[lognj 


-4  00  as  n-¥  oo. 


Thus,  the  rate  for  convergence  in  probability  becomes 

(logn)“^/V/2ay^PdG(/n,Q„,«;,/o) -^0  as  n^oo, 
and  choosing  a„  =  for  some  7  with  0  <  7  <  1  results  in 
(logn)“^/2n(^"'>')/2  daif n,a„, w,  fo)  -^0  as  n  ->  00. 


3.1. 3.2  Spectral  Analysis.  Here,  /<,  is  the  true  parameter  value.  We 
analyze  the  estimator  /  =  %x,wfn  by  characterizing  the  error  as 

f  fo  —  (,f  •f^,vifo')  {^a,v)fo  /o)j 

interpreting  Oia,wfo  as  the  asymptotic  “infinite  sample  size”  solution  of  the 
estimation  problem.  In  any  norm  ||•||,  we  can  bound  the  error  of  the  estimator 
by  means  of 

III/  -  /of  <  11/  -  3!«,»/of  +  l|3!„,»/o  -  /of  =  Ti  +  T2, 

in  which 


Ti  =  11/ -  -  Ell/  -  3!„,„/,f  and 

li  =  E 11/  -  5!„,„/,f  +  ||3!.,„/„  -  /.f . 
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Since  ETi  =0,  it  follows  that  E  ||/— <  2T2.  The  “mean”  error  T2  is  the 
sum  of  a  “variance”  term  E  ||/  —  0la,wfo\\^  and  a  “bias”  term  —  /©IP- 

It  is  shown  in  Theorem  3.8  that  E/  =  ^a,wfo-  We  now  characterize  the 
behavior  of  T2  to  obtain  rates  of  convergence  for  E  ||/  —  /o|p. 

To  that  end,  let  Af  and  ipi  be  the  eigenvalues  and  X2,to-orthonormal  eigen¬ 
functions  of  Qi,u„  so  that 

=  ArVi,  and  =  Sij. 

Recall  that  Tw  =  ^'D*wT>,  where  2)  is  a  linear  differential  operator  of  order 
p  with  no  constant  term.  Note  that  we  are  using  the  operator  Qi,^.  =  2  +  7^,, 
and  that 

Qa.to  ~  (1  ~  0^)2  +  OlQl.w  =  2  -f  Q!(Qi,tu  —  2). 

We  use  Qi^u,  in  the  analysis  to  isolate  the  smoothing  parameter  a.  By  equa¬ 
tion  (3.12),  we  have 


for  all  /.  Therefore,  Qi,u,  is  a  positive  operator  and  all  eigenvalues  are  pos¬ 
itive.  With  =  constant,  we  see  that  Qi.wV’i  =  so  the  constant 

function  tpi  is  an  eigenfunction  with  eigenvalue  Aj  =  1.  Furthermore,  since 
we  have  on  one  hand 

<*.  =  Willi  +  l|B*lll  =  1  +  l|B*lll 


and  on  the  other  hand 


{iph  Ql,w1pi)y,  =  (i’i,  \  =  Ai  ^  (V'i,  \  ^ll^f  11^  =  Ai  \ 


it  follows  that 

A  =— ^ 

Therefore,  the  eigenvalues  are  all  positive,  and  they  satisfy 

1  =  Ai  ^  A2  ^  A3  ^  . 
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Also,  since  7^  is  a  linear  differential  operator  of  order  2p,  one  can  show  that 
there  are  constants  a  and  h  with 


0  <  o  ^  Ai  ^  6  <  00, 


so  that  the  eigenvalues  Af  decay  like  For  reference,  see  Riesz  and 

Nagy  [58],  p.  238,  and  Silverman  [66],  Lemma  5.1. 

The  are  also  L2,w-orthonormal  eigenfunctions  for  Q,a,w  and  3ta,w  These 
operators  act  on  the  eigenfunctions  as  follows. 


Qa,w^i  =  7i  Vi  and  Ola,wi/’i  -  7i^i, 


where 


^  =  (1  -  a)  +  aA^  ^  =  1  +  a(A<  ^  -  1). 
General  eigenfunction  expansions  are  then 


00 


^  ^i)w  V'i,  Ql.ti;®  = 

<=1  i=l 

00  00 

Qa,wX  =  7r^  i’i)w  >  and  Olc,,wX  =  53  Ti  {x,  i’i  • 

i=l  »=1 

For  the  analysis  of  AR  estimators,  relevant  eigenfunction  expansions  are 

00  oo  oo 

fo  ^  ^  ^  Ola^yjfo  =  ^  ^  f  =  ^  ^ 


*=1 
oo 


t=:l 


t=l 


f  ^a^wfo  ^  End  ^a,ti?/o  fo  —  ^ 

*=1  t=l 

where 

Ci  =  {fo,'tpi)y,  =  j  folpiW  =  j  ipiWdFo  and 

A  =  </n,V’i)w  =  J  MiW  =  j  IpiWdFn. 

Now,  note  that  E^i  =  Cj.  This  implies  E/  =  3la,u>/o)  which  in  turn  estab¬ 
lishes: 
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Theorem  3.8.  For  any  weight  w,  the  AR  density  estimator  f  =  Oia,wfn  is 
an  unbiased  estimator  of  ^a,vifo,  where  fo  is  the  true  parameter  value. 

The  condition  ||Qi,ti;/||2;  <  oo  is  equivalent  to  Wahba’s  [78]  “very  smooth.” 
We  refer  to  the  stronger  condition  ||Qi,u;/||i,«,  <  oo  as  “way  smooth.”  Con¬ 
vergence  of  the  bias  Ola,wfo  —  fo  is  provided  by: 

Lemma  3.9.  Suppose  that  ||Qi,u,/o||^  <  oo.  Then 

- /»l£  =0(a“)  and 
113!.,./.  - /.II?,  =  0(a), 

If  in  addition  ||Qi,t«/o|li,tu  < 

\\^a,wfo  -  fo\\l,w  =  O  (a^) . 

We  can  also  provide  rates  of  convergence  for  the  error  variance  term 
/  -  %x,wfo‘  Note  that  the  rates  of  convergence  given  in  Lemma  3.10  apply 
when  the  weight  function  has  the  “correct”  value  of  w  =  I//©. 

Lemma  3.10.  Suppose  that  w  =  !//<,  and  ||Qi,to/o||w  <  oo.  Then 

E 11/  -  3!«,./.|£  =  O  and 

E 11/  -  3t.,/o||?,„  =  O  (n-'a-'-'/^'’)  . 

Combining  Lemma  3.9  and  Lemma  3.10  establishes 

Theorem  3.11.  Suppose  that  w  =  !//<,  and  HQi.to/ollw  o®-  Then 

El|/-/.l£  =  O  {n~'a~^f^'' +  c?)  and 

E||/-/.llf,  =  0(n-‘a-'-‘/*P  +  a). 

If  in  addition  llQi.ui/olli.w  <  oo,  then 

E||/-/.ll?,  =  0(n->a-'-‘/^  +  o“). 
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This  result  implies  the  following  statements  about  rates  of  convergence 
under  various  conditions.  In  the  norm  II -lU  convergence  is  obtained  for  any 
sequence  a„  with 

Qin  — 0  and  oo  as  n  — ^  oo. 

In  particular,  with 

a„  -  =  „-i/2+i/(8p+2)^ 

the  best  rate  of  convergence  provided  by  Theorem  3.11  is 

E 11/  -  foWl  =  O  =  O  (n-i+i/(4p+i)) . 

In  the  norm  ||  •  convergence  is  obtained  for  any  sequence  with 
ttn  — ^  0  and  — )■  oo  as  n-^  oo, 

and  when 

a„  ~  „-2p/(4p+l)  ^  ^-l/2+l/(8p+2) 

the  best  rate  of  convergence  provided  by  Theorem  3.11  is 

E 11/  -  /o|li«,  =  O  =  O  (n-V2+i/(8p+2))  ^ 

Suppose  additionally  that  /<,  is  way  smooth  and- 

^  „-2p/(6p+l)  ^  ^-l/3+l/(18p+3)_ 

Then  the  best  rate  of  convergence  provided  by  Theorem  3.11  is 

E 11/  -  foWU  =  O  =  O  (n-2/3+2/(i8p+3))  _ 

It  remains  to  characterize  the  random  error  component  \\f—^a,wfo\\^  in  order 
to  obtain  convergence  rates  for  ||/—  /o|p  0. 
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3.2  Inverse  Problems 

In  this  section,  we  discuss  a  generalization  of  density  estimation  that  in¬ 
volves  indirectly  observed  data.  We  present  two  examples,  the  deconvolution 
problem  and  Wicksell’s  corpuscle  problem. 

Let  ,Xn  be  i.i.d.  with  unknown  p.d.f.  /<,,  which  we  wish  to  es¬ 

timate.  Suppose  that  the  Xj’s  are  not  directly  observable,  but  that  we  do 
observe  the  “transformed”  i.i.d.  data  Zi, . . .  ,  where  the  common  p.d.f.  of 
the  ZiS  is 

9o{t)  =  [0Cfo]{t) 

for  some  known  operator  DC.  Based  on  the  data  Zi, ,  Zn,  the  AR  functional 
for  estimation  of  Qo  is 

Jig)  =  -(Pn,^)„  +  |ll5||^ 

The  “correct”  weight  is  w  =  l/g,  and  gn  =  G'^  where  the  empirical  c.d.f.  Gn 
is  based  on  the  observable  data  Zi,...  , Zn- 

Since  g  =  DC/,  we  estimate  /  using  the  AR  objective  functional 

Jif)  =  -{gn,OCf)^  +  im\\l 
For  nonparametric  estimation,  we  penalize  /  and  use 

Jif)  =  -  {9n,0Cf)^  +  imwl  +  f  I|2)/1IL 

where  the  “correct”  weight  is  it;  =  1/DC/.  The  penalty  operator  has  the 
usual  characteristics:  D  is  a  linear  differential  operator  of  order  p  ^  1,  with 
no  constant  term,  defined  on  a  suitable  domain  with  appropriate  boundary 
conditions. 

Let  (DC'/)(r)  denote  the  Gateaux  derivative  of  DC  at  /  in  the  direction  r. 
Then  the  Gateaux  derivative  of  J  at  /  in  the  direction  r  is 

J'{/)W  =  -  (3C7)(r)).  +  {Xf,  (3C7)(r)).  +  a  (B/,  Br)„ 

=  -  (wg„,  {K'/)(r))  +  {wXf,  (rf)(r))  +  a  (wB/,Br> 

=  -  {(X'fyw9„,  r)  +  ((X'fywXf,  r>  +  a  {Tl'wVf,  r) , 
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and  the  differential  equation  for  the  estimator  /  of  /„  is 

{[(X'fywX  +  aT)*w^]  f)  (t)  =  [iX'fYwgn]  (t)  (3.20) 

or,  more  briefly, 

[(X'/YwX  +  a‘D*w'D]  f  =  {X'fYwgn. 

This  applies  when  X  is  an  arbitrary  (linear  or  nonlinear)  operator. 

If  3C  is  a  linear  operator,  then  X'f  =  X.  In  this  case,  the  differential 
equation  simplifies  to 

[X^wX  +  a'D*wV]f  =  X*wgn.  (3.21) 

It  is  possible  to  conduct  a  spectral  analysis  of  the  linear  inverse  problem  us¬ 
ing  the  technique  of  “simultaneous  diagonalization”  for  infinite-dimensional 
operators  and,  thereby,  obtain  general  results.  See  Cox  [9]  and  Cox  and 
O’Sullivan  [10]  for  work  of  this  nature.  We  do  not  pursue  this  analysis  here. 


3.2.1  Deconvolution 
Consider  the  model 

X  =  Z  +  W 

where  the  random  variables  have  densities 


X  go,  Z  fo,  and  W  r^k. 

We  assume  that  Z  and  W  are  independent  and  that  A:  is  a  known  continuous 
density.  The  parameter  of  interest  is  the  density  fo  of  Z.  However,  we 
observe  X  and  not  Z.  The  cumulative  distribution  of  X  is 

/oo  pt~-W 

I  fo{z)k{w)dzdw, 

OO  Joo 

so  X  has  density 

/OO  poo 

fo{t  -  w)k{w)  dw=  I  k(t  -  x)fo{x)  dx. 

■OO  J—oo 
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This  is  called  the  convolution  of  k  and  /<,,  and  denoted 


[k  *  fo](t)  =  (  k{t-  x)fo{x)  dx. 

J  —00 


So,  in  the  context  of  section  3.2,  we  have  [X/o](<)  =  [k*  /o](<).  Since  !JC  is  a 
linear  operator,  the  equation  for  the  AR  estimator  /  of  /<,  is  given  by  (3.21). 

This  problem  can  be  analyzed  directly  for  a  specific  contaminating  distri¬ 
bution  k,  at  an  intermediate  level  of  generalization  with  X  being  the  convo¬ 
lution  operator  for  an  arbitrary  k,  or  through  the  general  methods  referenced 
at  the  end  of  section  3.2.  We  pursue  no  further  analysis  here,  but  in  chapter  4 
we  do  present  some  example  calculations  and  a  simulation  study  that  com¬ 
pares  the  nonparametric  AR  deconvolution  estimator  to  other  deconvolution 
estimators. 


3.2.2  The  Corpuscle  Problem 


Imagine  a  solid  medium  in  which  spheres  occur  according  to  a  homogeneous 
Poisson  process  with  unknown  rate  A.  The  sphere  radius  is  the  random 
variable  of  interest,  with  p.d.f.  which  has  support  [0,  Rm]  where  Rm  >  0. 
The  radii  are  not  observed  directly.  A  planar  slice  is  taken  through  the 
medium,  and  we  observe  the  resulting  radii  of  circles  that  are  the  intersection 
of  the  plane  and  certain  spheres.  Of  course,  the  slice  misses  some  spheres 
completely  and  cuts  the  others  at  unknown  latitudes.  The  p.d.f.  go  of  the 
observable  circle  radii  is  calculated  in  the  following  manner. 

Let  the  sphere  radius  R  have  p.d.f.  foI[o,RMV  Independently  of  R,  let 
the  random  variable  Y  have  the  uniform  distribution  on  {—Rm,  Rm)-  So  the 
p.d.f.  of  y  is 

^  T 

OD 

ZKm 

and  y  represents  the  sphere  coordinate  at  which  the  slice  is  taken.  Let  the 
random  variable  5  be  given  by 


S  = 


1,  |r|<ie 

0,  \Y\>R, 
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so  that  (5  =  1  if  and  only  if  the  sphere  is  sliced,  in  which  event  + 
where  D  is  the  circle  radius.  The  c.d.f.  of  circle  radius  is  denoted  by  Go,  and 
so 


1  _  Go(t)  =  Pr(I>  >  t\5  =  1)  = 
The  denominator  is 

fO 


Pt{D  >  t  and  <5  =  1) 
Pr((5  =  1)  ■ 


nr  1  1  pRM  1 

dydr  =  —  rfo{r)  dr  =—ER. 

r  Jo 


Since  D  >  t  and  <5  =  1  together  imply  that  both  t  <  R  <  Rm  and  < 
R?  we  calculate  the  numerator  as 


fRu  r  I  1  /**"  , - 

/  /  dy  fo{r)  dr^—  Vr^  -  t^fo{r)  dr 

Jt  J _  Jt 


-Vr^ 


Then  the  circle  radius  c.d.f.  is 


Goit)  =  1  ^  -  t^foir)  dr, 


and  the  p.d.f.  is 


So  we  can  write 


erI 


1  t 


9o  ~  ^foi 


fo{r)  dr. 


where  the  operator  is  given  by 


Since  %  is  a  nonlinear  operator,  the  differential  equation  for  the  AR  estimate 
/  of  fo  is  given  by  (3.20).  There  is  a  demonstration  of  the  nonparametric 
AR  corpuscle  problem  estimator  in  chapter  4,  where  we  also  show  that  the 
problem  is  essentially  linear. 
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3.3  Poisson  Process  Intensity  Estimation 

In  this  section,  we  discuss  the  application  of  AR  estimation  to  the  Poisson 
process.  Due  to  the  close  relationship  between  density  estimation  and  Poisson 
process  intensity  estimation,  previous  results  apply  here  with  only  slight 
modification. 

With  h  fixed,  the  AR  functional  for  estimation  of  the  intensity  =  G'„ 
of  a  Poisson  counting  process  N  on  [0, 1]  is 


The  natural  constrained  optimization  problem  is  then 

minimize  Jh{g)  subject  to  p  €  T2  H  6,  (3.23) 

where  the  constraint  set  is  6  =  ^  ^  0}.  We  can  construct  a  sequence 

{^m}TO=i  in  2^2  n  e  with  the  property  that  J{gm)  ->  -00  as  m  ->  00,  thereby 
showing  that  J  is  unbounded.  As  a  result,  the  optimization  problem  has 
no  solution  in  L2.  In  fact,  Gm{t)  =  f^gmiv^du  converges  in  L2  to  iV,  the 
sample  path  of  the  counting  process.  Details  are  in  the  proof  of: 

Theorem  3.12.  Problem  (3.23)  has  no  solution. 

As  in  the  case  of  density  estimation,  we  can  penalize  the  objective  to 
obtain  a  related  problem  that  does  have  a  useful  solution.  Specifically,  the  - 
penalized  problem  is 


minimize  Jhig) 
9 


subject  to  g  eOipDQ, 


(3.24) 


where 


Jh{9)  =  ^j  f^dN  + 


1  ,  a  f  {‘Dg? 

2]  h  2]  h 


for  some  a  >  0.  As  in  the  case  of  density  estimation,  D  is  a  linear  differential 
operator  of  order  p  ^  1.  Of  course,  D  has  no  constant  term  and  is  defined 
on  a  suitable  domain.  We  formally  state: 
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Theorem  3.13.  Problem  (3.24)  has  a  unique  solution. 

If  the  differential  operator  D  has  order  p,  then  the  problem  has  a  unique 
solution  with  p  -  1  continuous  derivatives.  Thus,  one  may  choose  the  penal¬ 
ization  based  on  which  function  (derivative  of  Go)  one  wishes  to  estimate. 

As  in  the  parametric  case,  we  have  a  theorem  characterizing  the  limit  of 
the  estimator  sequence. 

Theorem  3.14.  With  n  fixed,  if  the  AR  estimator  sequence  associated  with 
problem  (3.24)  converges,  its  limit  is  a  maximum  penalized  likelihood  estima¬ 
tor. 


3.3.1  Representation  of  the  Intensity  Estimator 

In  this  section,  we  give  a  characterization  of  the  AR  intensity  estimator. 
The  situation  is  analogous  to  the  density  estimation  case  presented  in  sec¬ 
tion  3.1.1,  so  the  discussion  is  brief. 

The  intensity  estimator  g  is  the  solution  of 

=  0  for  re  0<p, 


and  so  the  estimator  satisfies  the  differential  equation 

^3  +  oh  •  g  =  dN. 

Here,  dN  =  ^t,.,  where  St  is  the  point  mass  at  t.  We  can  write 

9ii)  =  ^2  =  /  9(.u)  du  =  'f2  Su  (<). 

i=i  •'0  <=i 


where  —St{u)  =  sJu),  and  observe  that  s*  satisfies 
au 


<5*. 
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3.3.2  Special  Case 


We  can  compute  the  solution  of  the  AR  intensity  estimation  problem  in 
this  special  case,  which  is  the  Poisson  process  analogue  of  the  problem  of 
section  3. 1.2.1. 

With  Dx  =  x'  and  h  =  1,  we  can  derive  a  closed-form  solution  for  the 
unconstrained  version  of  problem  (3.24).  We  impose  the  boundary  conditions 
s'(0)  =  sj(l)  =  0  to  specify  a  solution.  We  see  that  the  solution  does  in  fact 
satisfy  the  constraints.  Since  H{t)  =  t,  h{t)  =  1,  and  h'{t)  =  0,  we  have 

St  -  as"  =  St. 


Integration  yields 

St{u)-aS'^ 

with  boundary  conditions  St(0)  =  iS"(0)  =  ^"(l)  =  0,  for  St  €  IK2.  As  in 
section  3. 1.2.1,  the  solution  is 


(u)  = 


5.W 


f 

csch  P  cosh[i3(l  - 1)]  sinh(;i3u),  u<t 
1  —  csch  13  cosh(/3t)  sinh[j3(l  —  u)],  u'^t, 


where  Also,  note  that  S't(l)  =  1  for  alH  €  [0, 1].  The  nonpara- 

metric  compensator  estimate  is  then 


G{u)  =  '^StXu). 

»=i 

This  establishes  existence  and  uniqueness  of  the  solution,  and  the  form 
of  the  unconstrained  solution.  Since 

f/3cschj0cosh[^(l  —  t)]cosh(/?u),  u<t 
St{u)  =  St{u)  =  < 

1/0  csch  13  cosh(^t)  cosh[/0(l  -  u)],  u'^t, 

we  see  that  St{u)  ^  0  for  all  t  and  u  G  [0,1];  therefore,  the  unconstrained 
intensity  estimator  g  =  G'  satisfies  g'^Q.  So  we  have  solved  the  constrained 
optimization  problem.  We  recognize  the  solution  ^  as  a  (boundary-corrected) 
bilateral  exponential  kernel  intensity  estimator. 
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3.4  Proofs 


Proof  of  Theorem  3.1.  Let  ti, . . .  ,tn  be  ordered  distinct  data  in  (0, 1). 

Suppose  that  h{ti)  >  0  for  all  i.  This  is  reasonable  because  the  “correct” 
value  of  h  is  /<,,  which  is  necessarily  positive  at  each  t,-.  Furthermore,  we 
suppose  that  h  is  continuous  since  we  axe  primarily  interested  in  smooth 
(continuous  or  even  differentiable)  estimates  of  the  density. 

For  m  large  enough  so  that  ti,  U  —  tj-i,  and  1  —  tn  are  all  greater  than 
l/m,  let 


9m{t)  =  I 


(3A:€{l,...,n}) 
0,  (Vfc  €  {1,...  ,n}) 


Clearly,  €  Lj  D  6  for  all  m  large  enough.  Note  that 


„  2  n 

**~2m 


dt. 


Since 


*«+5S 


/ 


2m 


m  1 

dt  — )•  TTr-r  as  m  -¥  oo, 


h{t)  h{ti) 


it  follows  that 


We  have  established  that  Jn,h{9m)  —oo  as  m  oo.  Thus,  J  is  unbounded 
below  on  L2  H  C,  and  the  optimization  problem  (3.1)  has  no  solution.  □ 

Proof  of  Lemma  3.2.  The  equivalence  of  ||  •  ||p,i  and  ||  •  |lp,p  is  a  Special  case 
of  Corollary  4.16  of  Adams  [1].  For  the  equivalence  of  ||  •  ||pj,  and  ||  •  ||p,®,  see 
Silverman  [66].  □ 

Proof  of  Theorem  3.3.  See  Thompson  and  Tapia  [73].  □ 
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Proof  of  Corollary  3.4.  We  verify  the  hypotheses  of  Theorem  3.3  in  the 
context  of  problem  (3.6),  where  the  Hilbert  space  is  Oi  =  IKp,  the  constraint 
set  is 

e  =  {/  €  3{p  I  /  >  0  and  //  =  1}  , 
and  the  functional  is 


,  a  f  wr 
2j  h  ■ 


The  constraint  set  C  is  clearly  convex.  To  see  that  6  is  closed,  consider 
a  sequence  gn  in  C  and  a  point  g  with  ||^n  —  g\\  ->  0.  Since  g  and  each 
of  the  gn  are  absolutely  continuous,  we  have  gn-^  g  pointwise,  and  therefore 
g  ^  0.  Writing  g  =  g^-  (gn  -  g),  we  have  /g  =  1  -  /(gn  -  g).  Since 

I  Sidn  -9)\<.  5  \9n  -9\<:  y/S\9n-9\^  <  l|pn  “  9\\,  by  Jensen’s  inequality, 
we  have  /  g  =  1.  Therefore,  6  is  closed. 

Again,  since  ||/  —  g||  0  implies  f  9  pointwise,  each  map  /  i-4 

f{t)/h(t)  is  continuous.  The  definition  of  the  norm,  along  with  the  norm 
equivalences  of  Lemma  3.2,  implies  that  /  ^-^  /  P/h  and  f  ^  f  {Vf)^/h  are 
continuous.  Therefore,  J  is  continuous. 

The  first  Gateaux  derivative  of  J(/)  in  the  direction  r  is 


rV)M  =  -/^dF„  +  /f+a/ 


D/  Dr 


The  second  Gateaux  derivative  of  J{f)  in  the  directions  r  and  s  is 

The  cone  tangent  to  6  at  /  is  defined  as 


T{f)  =  e  IK :  3t  >  0,  such  that  f  +  tT]€  6}, 

and,  by  definition,  J"  is  uniformly  positive  definite  in  6  if  there  is  a  fc  >  0 
such  that  for  each  /  €  6 
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In  fact,  by  the  norm  equivalences  of  Lemma  3.2,  we  see  that  J"  is  uniformly 
positive  definite  on  all  of  □ 


Proof  of  Theorem  3.5.  The  negative  penalized  loglikelihood  for  density 
estimation  is 

i(/)  =  -  j\oiSdF„  +  cwU) 

for  a  suitable  penalty  functional  V.  Its  Gateaux  derivative  is 

=  -  J  jdFn  +  au'{f){r). 

We  express  the  constraint  /  /  =  1  as  the  functional  relationship  Tf  =  0  and 
note  that  the  derivative  of  T  is 


’'(/)(r)  =  / . 


By  the  method  of  Lagrange  multipliers  for  infinite-dimensional  constrained 
optimization  problems,  the  penalized  likelihood  estimator  /  satisfies  for  all 
r  and  some  real  A 

L'(/)(r)  +  Ar(/){r)  =  0, 

which  is  to  say 

-  y  y  <iF„  +  a/ r  +  ai/(/)(r)  =  0. 

See  Luenberger  [46]  for  a  discussion  of  the  Lagrange  multiplier  method  in 
infinite-dimensional  spaces.  Now  consider  minimization  of  the  AR  objective 
functional 


Differentiating,  we  have 


h  ij 

r  mr 

h  ' 

fr 

T  +  “j 

fVfVr 

h 

and  the  constrained  solution  /  satisfies  for  all  r  and  some  real  /x 


JV){r)  +  nT'{f){r)  =  0, 
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which  is 


Vf  Dr 


Upon  convergence,  we  have  f  =  h,  so  the  relationship  is 

/•_  .  .  r7>fDr_ 


- 1  ydF„  +  (,l+^)jT  +  aJ‘^^^  =  0. 

/Df  Dr 

— j —  completes  the  proof.  □ 

Proof  of  Theorem  3.6.  See  Bosq  and  Lecoutre  [6].  □ 

Proof  of  Corollary  3.7.  See  Bosq  and  Lecoutre  [6].  □ 

Proof  of  Lemma  3.9.  See  the  proof  of  Theorem  3.11.  □ 

Proof  of  Lemma  3.10.  See  the  proof  of  Theorem  3.11.  □ 

Proof  of  Theorem  3.11.  The  relevant  norms  expressed  in  terms  of  eigen¬ 
function  expansions  are 

il/X  =  E4.  IIQi../.£  =  E\-V,  = 

i=l  i=l  i=l 

00  00 

11/  -  %x,wfo\\l  =  -  Ci)^  and  Wfo  -  X,,v,fo\\l  =  5^(1  -  JiTcl 

»=i  »=i 

Then  the  error  of  the  estimator  can  be  bounded  by 

ill/  -  /.Hi  <  11/  -  K«,./.lli,  +  IK,/.  -  Mi 

=  £  ifw  -  +  £(1  - 

*=1  »=1 

To  compute  the  expected  value,  note  that  the  expectation  of  the  first  sum 
is  a  (weighted)  sum  of  the  Vary9<.  Let  <t?-  =  J  dFo.  It  is  straightfor¬ 
ward  to  compute  Cov{^i,fij)  =  —  CjCj),  so  that  Var^i  =  — 

We  then  have  the  expectation 

E 11/ -  Var A  = 
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which  gives  a  bound  on  the  expected  error. 


2 

W 


5  E 11/  -  /.|£  <  E 11/  -  -  fX 

=  ^E'rf(‘4-‘^)  +  E(1  -  7if 4  =  n-‘Si  +  (3-25) 

Tl.  . 

»=1  t=l 

In  the  analysis  of  we  confine  our  attention  to  the  case  of  h  =  /©,  or 
w  =  1/fo.  Then  we  have  <r?-  =  J  ‘^iipjW  —  Sij  and  Var^j  =  n“^(l— cf )  ^  n~^, 
so  that 

We  approximate  with  an  integral  in  the  manner  of  Silverman  [66]  or 
Wahba  [78]. 


Si 


1 _  r°° _ 1 

-  CuY  Jo  (1  +  d 


h 


where  6  =  a/(l  —  a).  We  use  the  change  of  variables 
1 


1  + 

to  obtain 

Therefore, 


X 


,  zl-V2P(l  -  ^)l/2P-l 
and  dx  = - _  „ - dz 


ez  )  ’ 

^  _V - 2p>  2p)  ^  Qf^-ll2p^ 

^  2p6^l'^p  ^ 

n~^Si  =  O  . 


2p  e^i^p  ^2 


We  can  obtain  a  bound  for  S2  by  making  use  of  the  smoothness  condition 
||Qi,u>/||«  =  Ei^i  <  oo-  Observe  that 


00  00  ^2/\  1^-2^2 


i=l 


\“2^2  00  \-2^2 

^(1-a  +  aAr^f  + 
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and  that 


as  «\0,  Vi. 


(1  +  «V) 

Hence,  by  the  dominated  convergence  theorem  (the  dominator  is  also  the 
limit  here), 

S2 ->  «*||Qi,„/|£  as  e-»o. 

Therefore 

82  =  0  (a^) . 

We  thus  obtain  the  asymptotic  rate  of  convergence  for  expected  error 
E 11/  -  foWl  =  O  ^  ^2)  ^ 

In  the  native  Sobolev  norm  given  by 

NIL  =  Irt  + 

we  have 

»=1 

As  in  equation  (3.25),  we  can  write 

i  E 11/  -  /.II!,.  <  i  f;  ArV(4  -  <1) + f;  A,-'(1  -  4  =  JSi + Sj. 

”  »=1  »=1 

And  when  w  =  1//,  we  have 

Ar^ 


00 


Si  <  W  -  g  (J  _  „ 


Approximating  iSi  with  an  integral  results  in 


(1  +  9x^pf 


dx  = 


(1  -  a) 


We  evaluate 


g(i-g.i+A) 

■*1  2p  01+V2P 
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to  obtain 


Si=0 


The  second  sum  satisfies 


3^  2 


Ar^ci 


-1\2- 


Using  ||Qi,w/o||2;  <  00,  it  can  be  shown  by  dominated  convergence  that 
$2  =  0(a).  See  Silverman  [66].  Consequently, 


If  we  assume  the  additional  smoothness  condition  ||Qi,ti)/olli,«,  <  oo  on 
the  true  density,  we  can  apply  dominated  convergence  to  obtain  S2  — 
=  O  (a^)  and  in  turn 

E 11/  -  fo\\l,w  =  O  +  a^) . 

□ 


Proof  of  Theorem  3.12.  Apply  the  method  in  the  proof  of  Theorem  3.1. 

□ 

Proof  of  Theorem  3.13.  Apply  the  method  in  the  proof  of  Corollary  3.4. 

□ 


Proof  of  Theorem  3*14.  Similar  to  Theorem  3.5.  The  negative  penalized 
loglikelihood  for  a  Poisson  process  on  [0, 1]  is 


H9)  =  -[  loggdN+f  g  +  au{g) 

Jo  Jo 

for  a  suitable  penalty  functional  u.  To  evaluate  the  Gateaux  derivative,  note 
that  ^  ^ 

[  log{g  +  Xr)dN  =  f  -dN 
^^Jo  A=0  Jo  9 
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and 

^  /*  (^  +  Ar)  =  f  r. 

Jo  \=0  Jo 

Therefore,  a  penalized  loglikelihood  estimate  g  satisfies 

L’{g){r)  =  —  f  -dN+  [  r  +  ai/'(^)(r)  =  0,  Vr. 
Jo  9  Jo 

Now  consider  minimization  of  the  objective 


Differentiating,  we  have 


T>gT)r 


=  0,  Vr. 


At  a  fixed  point,  we  have  h  =  g,  so  that 


-J  ^dN  +  j  r  +  a  j 


"DgVr 


=  0,  Vr. 


Thus,  the  methods  produce  the  same  solution  if  u'{g){r)  =  J 


VgVr 
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4.  Practical  Nonparametric  AR  Estimation 

In  this  chapter,  we  consider  the  practical  computational  aspects  of  non¬ 
parametric  asymptotic  regression  estimation.  We  have  seen  that  there  are 
closed-form  solutions  in  certain  cases  for  the  nonparametric  AR  estimation 
problem.  In  general,  however,  this  is  not  possible.  So,  in  order  to  obtain 
numerical  results,  we  use  a  finite-dimensional  (discretized)  approximation  to 
the  infinite-dimensional  problem. 

In  section  4.1,  we  present  the  discretization  scheme  for  density  estima¬ 
tion  and  inverse  problems.  Since  density  estimation  techniques  apply  to 
the  -problem  of  intensity  estimation  for  completely  observed  Poisson  pro¬ 
cesses,  the  Poisson  process  problem  is  not  discussed  explicitly.  Section  4.2 
describes  a  data-driven  procedure  for  selection  of  the  smoothing  parameter. 
In  section  4.3,  by  means  of  a  Monte-Carlo  simulation  study,  we  compare 
nonparametric  AR  estimation  to  several  competitive  methods  for  solving  the 
deconvolution  problem. 

4.1  Discretization  Techniques 

4.1.1  Density  Estimation 

We  begin  by  recalling  the  nonparametric  (penalized)  AR  density  estimation 
problem  from  chapter  3.  That  is, 

minimize  J{f)  subject  to  /  G  IKp  n  ^  0,  //  =  l}  , 

where  the  objective  functional  is 

=  +  +  (4.1) 

q;  >  0  is  a  constant  called  the  smoothing  parameter,  h  is  a  p.d.f.,  D  is 
a  linear  differential  operator  of  order  p  ^  1  with  no  constant  term,  and 
Fn  is  the  empirical  c.d.f.  based  on  a  sample  X  =  {Xi,...  ,X„}  of  size  n 
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distributed  with  true  density  fg.  In  terms  of  the  weighted  L2  inner  product 
(^1  y)va  —  f  w  =  1/h  to  express  (4.1)  as 

j(/)  =  -{/,A>„+lii/iis,+fii»/iii 

=  5  (/.  P + “i®””®)/}*  -  (/".  />» 

=  1  (/,  (to  +  oC'rn®)/) 

where  Q  =  w  +  a'D*wT>  and  b„  =  wfn-  The  discrete  version  of  this  problem 
is 


minmize  —  bl^f  subject  to  /  ^  0  and  k*f  —  1,  (4.2) 

which,  of  course,  is  the  quadratic  programming  problem.  High-quality  soft¬ 
ware  for  solving  this  problem  is  readily  available. 

Note  that  the  solution  of  the  imconstrained  problem  / 

minimize  \fQf  —  b^f 

is  also  the  solution  of  the  system  of  linear  equations 

Qf  =  &«.  (4.3) 

In  some  applications,  the  constraints  can  be  ignored  and  the  estimator  can 
be  computed  as  the  solution  of  (4.3).  An  eeisier  computational  problem  does 
not  exist. 

At  an  intermediate  level  of  complexity,  we  have  the  equality-constrained 
problem 


minmize  \f*Qf  —  b*„f  subject  to  k*f  =  1.  (4.4) 

The  solution  to  (4.4)  integrates  to  1,  but  may  attain  negative  values.  How¬ 
ever,  in  a  reasonable  proportion  of  practical  situations,  the  solution  does 
indeed  satisfy  the  positivity  constraint.  If  necessary,  the  solution  can  be 
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truncated  and  renormalized  to  obtain  a  non-negative  density  estimate.  This 
observation,  coupled  with  the  complexity  of  (4,2)  relative  to  (4.4),  makes  it 
worthwhile  to  compute  (4,4),  check. for  positivity,  and  solve  the  quadratic 
program  only  when  necessary.  In  fact,  (4.4)  can  be  realized  by  solving  two 
linear  systems,  as  we  now  show. 

Let  b  =  bn.  The  Lagrange  multiplier  condition  for  (4.4)  is 

Qf-{b+Xk)  =  Q 


where  A  is  a  real  constant.  With  c  and  d  such  that  Qc  =  b  and  Qd  =  k,  we 
have  f  =  c  +  Xd.  Because  k'f  =  k'{c  -t-  Ad)  =  1,  we  have  A  =  (1  —  k'c)/{k'd) 
and 

\  —  kc 

^  k'd 


We  now  discuss  the  particular  form  of  the  (approximate)  discrete  rep¬ 
resentation.  Our  discretization  is  based  on  m  values.  Let  D  denote  the 
mxm  matrix  representer  of  the  differential  operator  D,  and  let  W  denote 
the  mxm  diagonal  weight  matrix  of  the  inner  product.  Here,  /„  is  a  dis¬ 
crete  representation  of  the  empirical  point  measure  (i.e.,  a  histogram  estimate 
with  m  bins),  and  6„  =  W fn-  The  quantities  g  and  k  are  m- vectors,  and 
Q  =  W  +  aD*WD. 

For  convenience,  we  take  the  domain  of  all  functions  to  be  [0, 1].  Par¬ 
tition  this  interval  into  m  subintervals  of  length  1/m.  Denote  by  the 
subinterval,  so 


1 1 

and  j 

,  / 

^  z  —  1  i  ' 

0,  — 

u  = 

) 

m  _ 

\ 

^  m  m  ^ 

i  €  {2,...  ,m}. 


Functions  are  constant  on  subintervals,  with  values  to  be  taken  at  the  mid¬ 
points  of  subintervals.  The  discrete  domain  values  are  then 

(xi, . . .  ,  Xm)  where  Xi  =  - — i  G  {1, . , .  , m}. 

m 

The  functions  h,  w  =  1/h,  and  f  are  represented  by  m-vectors.  I.e., 
h  —  {hi,...  ,hm),  where  hi  —  h{xi)  for  iG{l,  ...,m}. 
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and  likewise  for  w  and  /.  The  vector  k  is  the  representer  of  z  so 

fc  =  (fci,...  where  ki  =  —  for  iG{l, 

Tfl 

The  discrete  version  of  /„  is  a  histogram  density  estimator.  Specifically,  we 
have 

771 

/n  =  (/n.i,  •  •  • ,  /n,m),  where  fn,i  =  — |X  H  /i|  for  i  G  {1, . . .  ,  m}.  (4.5) 

7» 

The  diagonal  weight  matrix  W  has  entries  Wu  —  Wi.  Derivative  operators 
are  represented  by  difference  matrices.  Let  Dp  denote  the  representer  of 
z  z^K  The  first  difference  is 


1  0 
-1  1 

p  even 
p  odd, 

with  the  appropriate  corrections  for  boundary  conditions.  These  representa¬ 
tions  are  used  to  compute  solutions  for  any  of  the  problems  (4.2),  (4.3),  and 
(4.4). 

When  we  refer  to  a  smoothing  parameter  value  of  a,  we  are  actually  using 
a^,  where  p  is  the  order  of  the  penalty  operator.  This  makes  the  smoothing 
parameter  independent  of  the  penalty  order. 

The  recursive  sequence  of  AR  estimators  (/o,  /i,  A, .  • . )  is  computed  by 

A  ^  JO 

using  an  initial  value  of  /o,  and  then  setting  tt;  =  l//<  to  obtain  fi+i  as  the 
solution  of  (4.2),  (4.3),  or  (4.4),  as  appropriate. 


D\  =  m 


0 

1 

-1 


1 

-1 


and  higher-order  differences  are  given  by 


!),=  •! 


KiDp-i, 

Dp-iDi, 
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The  following  example  graphs  (Figures  4. 1-4.5)  illustrate  the  various  so¬ 
lution  options  for  AR  estimation.  We  used  S-PLUS  to  perform  most  of 
the  numerical  calculations.  An  IMSL  routine  was  used  to  solve  quadratic 
programs.  One  data  set,  the  “Buffalo  snowfall”  data,  was  used  in  all  the 
examples.  This  data  set  of  size  n  =  63  consists  of  annual  snowfall  amounts 
in  Buffalo,  New  York,  for  the  winters  of  1910/11  through  1972/73.  This  is  a 
classic  example  of  a  data  set  that  may  come  from  a  trimodal  distribution. 

Figure  4.1  shows  the  effect  of  variation  in  m,  the  discretization  grid  size. 
Values  used  were  m  =  10,  20,  50, 100,  250,  and  500.  These  are  single-step  AR 
estimates  with  the  penalty  operator  "Dz  —  z",  a  smoothing  parameter  value 
of  O'  =  d.OOl,  and  a  uniform  reference  distribution.  Solutions  were  computed 
using  problem  (4.4).  The  solutions  are  non-negative  everywhere,  so  they  are 
also  solutions  of  problem  (4.2).  The  graphs  in  Figure  4.1  explicitly  show  that 
the  approximate  solutions  are  piecewise  constant.  In  the  remaining  figures, 
estimate  values  at  the  midpoints  of  intervals  are  connected  with  straight  lines 
to  give  continuous  graphs. 

Figure  4.2  shows  the  effect  of  variation  in  a,  the  smoothing  parameter. 
Values  used  were  a  =  0.0002,  0.0005,  0.001,  0.002,  0.005,  and  0.001.  These 
are  single-step  AR  estimates  with  the  penalty  operator  T>z  =  z",  a  grid  size 
of  m  =  100,  and  a  uniform  reference  distribution.  Solutions  were  computed 
using  problem  (4.4). 

Figure  4.3  shows  the  effect  of  various  penalty  operator  orders.  Here  we 
used  Dz  =  z^^  for  p  =  1,  2,  3,  4,  5,  and  6.  These  are  single-step  AR 
estimates  with  a  uniform  reference  distribution,  a  grid  size  of  m  =  100,  and 
a  smoothing  parameter  value  of  a  =  0.001.  Solutions  were  computed  using 
problem  (4.2). 

Figures  4.4  and  4.5  display  recursive  ARE  sequences  with  five  iterations. 
In  all  cases,  we  used  a  uniform  initial  weight,  a  grid  size  of  m  =  100,  and  a 
smoothing  parameter  value  of  a  =  0.001.  In  Figure  4.4,  the  penalty  operator 
is  T>z  =  z",  and  Figure  4.5  depicts  Dz  =  z^^  with  p  =  1,  2,  3,  4,  5,  and  6. 
Solutions  were  computed  using  problem  (4.2). 
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4.1.2  Inverse  Problems 


Recall  the  problem  formulation  from  section  3.2.  The  random  variables 
Xi,...  are  i.i.d.  with  unknown  p.d.f.  /<,,  which  we  wish  to  estimate. 
Observed  data  are  Zi,...  ,  where  the  p.d.f.  of  the  Zi  is 

S.W  =  [3C/,JW 

for  some  known  operator  X.  Here,  we  assume  that  DC  is  a  linear  operator.  The 
empirical  c.d.f.  based  on  the  observable  data  is  Cr„,  and  the  corresponding 
empirical  point  measure  is  p„.  An  estimate  /  of  /„  is  a  solution  of  the  problem 

minimize  J{f)  subject  to  /  G  Dtp  n{/:/^0,  //  =  l}.  (4.6) 

The  objective  functional  is 

W)  =  -  (9»,3C/)„  +  1||3C/|£  +  f  IID/lli 

=  1  (/,  {OCwX  +  aVw-D)/}  -  f) 

=  Uf,Qf) -{<>.,/)■ 

Let  K  denote  the  m  x  m  matrix  representer  of  the  operator  DC.  The 
discrete  versions  of  problem  (4.6)  are  given  by  (4.2),  (4.3),  and  (4.4)  with 
Q  =  DC*u;DC  +  a'D*wV  and  bn  =  X*wfn.  The  only  difference  is  the  presence 
of  the  operator  DC,  so  formulation  of  a  linear  inverse  problem  requires  the 
additional  determination  of  K.  We  now  do  this  for  two  particular  problems. 

4.1.2.1  The  Deconvolution  Problem.  Recall  from  section  3.2.1  that 
the  convolution  operator  DC  is  defined  by 

g{t)  =  [Xf]{t)  =  J  k{t-  x)f{x)  dx, 

where  fc  is  a  known  p.d.f.  The  discrete  version  is 

m 

9j  =  5^ 

*=1 
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which  is,  in  terms  of  its  components, 


'  9i 

ko  k-i  k-2  ‘  • '  fc-m+l 

fl 
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ki  fco  k-i  k-m+2 

f2 
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^2  ki  kfj  fe—m+S 

•  •  • 

h 

-  9m  . 

1 

■  ^ 

CO 

1 

1 

1 

•  •ie 

_ i 

-  - 

In  the  discrete  representation  with  m  intervals  on  [0, 1],  we  take  /  constant 
on  subintervals  with  boundary  points  i/m,  where  i  G  {0, . . .  ,m}.  To  obtain 
the  correct  proportions  in  the  representer  K,  we  take  k  to  be  constant  on 
intervals  of  width  1/m  that  are  centered  at  the  points  i/m,  i  G  Z.  This  leads 
to  the  definition 

(»+l/2)/m 

ki=  j  k{t)  dt. 

(«-l/2)/m 


4. 1.2.2  The  Corpuscle  Problem.  Recall  the  formulation  of  the  corpus¬ 
cle  problem  from  section  3.2.2.  Spheres  with  random  radii  are  distributed 
at  random  uniformly  in  a  solid  medium.  The  sphere  radius  p.d.f.  is  /«, 
with  support  [0,  Rm]-  A  slice  through  the  medium  gives  data  that  are  circles 
(sphere-slice  intersections)  with  p.d.f.  Po.  The  functional  3C<„  which  describes 
the  relationship  between  and  go,  is  nonlinear.  Specifically, 


9o{t)  =  [OCofolit)  = 


So“ 


However,  we  can  define  a  new  function  /*  by  /♦(t)  =  fo{t)  j  x/o(x)  dx^ 
and  observe  that 


/Rm 

{x^  -  dx, 

where  3C  is  a  linear  operator.  Since  /*  is  simply  a  constant  multiple  of  the 
P-di.  /o,  which  we  wish  to  estimate,  we  can  recover  /<>  by  normalizing  /,; 
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i.e., 

/  (t)  = _ — 

To  compute  the  discrete  operator,  let  us  first  scale  [0,  Rm]  into  [0, 1]  by 
using  t  =  Rms  and  x  =  Rmz.  The  integral  then  becomes 

9{Rms)  =  RmS  -  s^)~'^^^f{RMz)dz. 

( 

The  domain  of  integration  proceeds  from  the  midpoint  of  an  interval  to  1; 
thus,  the  discrete  version  is 


di  =  Rm  • 


i  - 1/2 
m 


Since  / {z^  —  s^)  dz  =  log  \z  +  {z^  —  ,  we  have 

m 

g  =  Kf,  or  9i  =  RM'*-^'^Vijfj  for  iG  m}, 

j=i 

where 

f 0,  j  <i 


Vij  =  < 

log 

t-1/2 

log 

4.2  Selecting  the  Smoothing  Parameter 

Practically  speaking,  we  need  an  automatic  procedure  for  selecting  the 
smoothing  parameter.  Cross-validation  is  suited  to  least-squares  problems 
and  has  been  applied  to  spline  smoothing  and  other  statistical  estimation 
and  regression  problems.  Since  discrete  AR  estimation  is  a  smoothing  spline 
problem,  the  technique  of  cross-validation  is  directly  applicable.  We  use 
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generalized  cross-validation  (GCV)  to  select  the  AR  smoothing  parameter 
in  density  estimation  and  linear  inverse  problems.  For  background  informa¬ 
tion  on  cross-validation  and  GCV,  see  Wahba  [78]  and  [79],  Gu  [23],  and 
Hardle  [27], 


4.2.1  Density  Estimation 


The  discrete  representation  (4.3)  has  unconstrained  solution  g  =  Mafni 
where  Ma  —  (iV  -I-  otD*WD)~^W.  Let  Tr  denote  the  trace  operator,  and  let 
the  weighted  discrete  inner  product  (*,  be  given  by  {a,b)^r  = 

The  GCV  score  function 


C(a)  = 


is  an  estimator  of  mean-squared  error.  The  GCV  criterion  for  selection  of 
the  smoothing  parameter  a  is 


minimize  C{a). 

a 

Note  that  the  numerator  of  C(a)  is  simply  the  weighted  squared  deviation 
-  fn,i)^‘Wi,  where  (/n,i, . . .  ,/n,m)  is  the  histogram  density  estimator 

of  (4.5). 

For  an  illustrative  example,  we  take  a  sample  of  size  n  =  100  from  the 
beta  distribution  (defined  on  page  97)  with  density  P{‘,  3,  5)  and  compute 
the  GCV  score  for  a  range  of  a  values  (Figure  4.6).  Using  the  value  of  ct 
selected  by  the  GCV  criterion,  we  then  compute  the  second  iteration  of  the 
AR  sequence  (Figure  4.7). 


4.2.2  Inverse  Problems 

This  is  similar  to  the  standard  density  estimation  case.  The  discrete  rep¬ 
resentation  has  unconstrained  solution  /  =  Map„,  where  Mq  =  (K*WK  -I- 
aD*WD)~^K*W.  The  technique  of  GCV  can  be  adapted  to  the  linear  inverse 
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problem.  The  GCV  criterion  in  this  case  is 


minimize  C(a)  = 

Q 


['lT(I-KMa)f  ■ 


There  is  an  extra  K  in  the  score  because  is  an  estimator  of  /  and 

KMa9n  is  an  estimator  of  g.  Note  that  g  is  the  distribution  of  the  (observ¬ 
able)  data. 

The  numerator  of  C{a)  in  this  case  is  the  weighted  squared  deviation 
where  Kf  is  the  transformed  estimator  of  the  under¬ 
lying  unobservable  density,  hence  an  estimator  of  the  observable  data  density 
g.  Of  course,  (gn,u  •  •  •  >Pn,m)  is  the  histogram  density  estimator  of  the  ob¬ 
servable  data. 

For  an  example  of  GCV  in  deconvolution  estimation,  we  take  a  sample  of 
size  n  =  100  from  the  density  /3(*,  3,  5)  as  the  signal.  The  noise  distribution 
is  normal  with  a  standard  deviation  of  0.1  and  zero  mean.  Once  again,  we 
compute  the  GCV  score  for  a  range  of  a  values  (Figure  4.8).  Using  the  value 
of  a  selected  by  the  GCV  criterion,  we  then  compute  the  second  iteration  of 
the  AR  sequence  (Figure  4.9). 

For  an  example  of  GCV  in  corpuscle  problem  estimation,  we  take  a  sample 
of  size  n  =  250  from  density  /5(‘,  5, 3)  as  the  signal.  As  usual,  we  compute 
the  GCV  score  for  a  range  of  a  values  (Figure  4.10).  Using  the  a  selected  by 
the  GCV  criterion,  we  then  compute  the  second  iteration  of  the  AR  sequence 
(Figure  4.11). 


4.3  Simulation  Study:  Deconvolution 

In  this  section,  we  present  the  results  of  a  modest  simulation  study  of  the 
deconvolution  problem,  in  which  we  compare  the  the  AR  deconvolution  es¬ 
timator  to  the  NEMS  estimator  of  Eggermont  and  LaRiccia  [15]  and  the 
Fourier  kernel  method  studied  by  Stefanski  [69],  Fan  [18],  and  others. 

We  use  the  signal  and  noise  distributions  for  which  results  are  tabulated 
in  Eggermont  and  LaRiccia  [15]  and  compare  our  AR  results  to  theirs  for 
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the  NEMS  and  Fourier  estimators.  Optimzil  smoothing-parameter  selection 
is  possible  for  all  three  estimators,  since  we  know  the  true  signal  distribu¬ 
tions.  There  is  an  automatic  procedure  for  selecting  the  NEMS  smoothing 
parameter,  and  we  use  GCV  to  select  the  AR  smoothing  parameter.  All  three 
estimators  are  compared  for  optimal  selection  of  the  smoothing  parameter, 
and  the  NEMS  and  AR  estimators  are  compared  for  automatic  selection. 
The  basis  for  comparing  the  various  estimators  /  in  this  simulation  is  the  Ly 
error 

where  /  is  the  density  of  the  underlying  signal  (without  noise).  The  average 
Li  error  for  a  number  of  repetitions  is  taken  as  an  estimate  of  E  ||/  —  /||i. 

Average  Li  errors  for  various  signal  distributions  are  presented  in  Ta¬ 
bles  4.1  (with  normal  noise)  and  4.2  (with  uniform  noise)  for  a  sample  size 
of  100.  Sample  means  and  standard  deviations  of  Li  errors  for  a  range  of 
sample  sizes  are  presented  in  Table  4.3,  but  these  results  are  limited  to  two 
of  the  normal-noise  cases. 

For  the  simulation  study,  signal  and  noise  distributions  are  based  on  the 
normal  density 

<^(®)  = and  a)  = -<f)  (-)  , 

V27r  <T  V<T/ 

the  uniform  density 

u(x)  =  /[o,i](x)  and  u{x',  a)  =  , 

and  the  beta  density 


/3(x,  a,  6)  = 


- and  P{x,  o,  6;  a)  = 


The  six  signal  densities  we  use  in  the  simulation  are 
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(£>1)  f{x)  =  0.9</>{x  -  5;  0.5)  +  0.1(j){x  -  7;  0.25), 

(£>2)  f{x)  =  0.9(f>{x  -  5;  0.5)  +  O.^x  -  7;  0.5), 

(D3)  f(x)  =  0.5(f>{x  -  5.82;  0.57)  +  0.5^(x  -  4.18;  0.57), 

(D4)  f{x)  =  4>{x  -  5), 

{D5)  f{x)  =  u(x  —  3;  5),  and 

(I>6)  f(x)  =  /3(x  -  3,  2.4,  3.6;  5). 

Convolution  is  accomplished,  of  course,  by  adding  noise  to  the  signal.  Noise 
densities  used  in  the  simulation  are  normal  (with  various  standard  devia¬ 
tions)  and  uniform  on  [0, 1].  Combinations  of  signal  and  normal  noise  are 
referenced  as  N1-N8.  See  Table  4.1  for  the  particular  values.  The  correspond¬ 
ing  combinations  with  uniform  noise  (referenced  as  Ul,  etc.)  are  detailed  in 
Table  4.2. 

The  NEMS  and  Fourier  estimation  parameters  are  detailed  in  Eggermont 
and  LaRiccia  [15].  We  now  describe  the  AR  parameters  for  this  simulation. 

For  the  given  combinations  of  signal  and  noise,  an  estimation  domain  of 
[0, 10]  is  adequate  to  capture  most  of  the  data.  Observations  fall  outside  of 
this  region  only  rarely. 

The  AR  penalty  Vx  =  x"  is  used  in  all  cases  except  N5,  U5,  and  N7,  in 
which  'Dx  =  x'  is  used  for  reasons  of  numerical  stability. 

AR  estimators  are  calculated  using  model  (4.4)  of  page  88  with  an  itera¬ 
tion  count  of  i  =  2.  The  initial  weight  for  the  AR  procedure  is  the  constant 
function.  In  the  case  of  automatic  smoothing  parzimeter  selection,  the  result 
of  GCV  applied  to  the  first-step  AR  estimator  is  used  as  the  second-step 
smoothing  parameter,  in  addition  to  using  the  first-step  estimate  as  the  ref¬ 
erence  measure  in  the  second  iteration. 

Sample  sizes  are  n  =  100  in  Tables  4.1  and  4.2,  and  n  =50, 100,  250,  500, 
and  1000  in  Table  4.3.  A  simulation  repetition  count  of  iV  =  1000  is  used 
in  all  cases.  To  conserve  computation  time,  a  grid  size  of  m  =  100  is  used 
throughout. 
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The  S-PLUS  leinguage,  version  3.4,  is  used  for  programming  the  proce¬ 
dure,  along  with  a  few  utility  functions  written  in  the  C  language.  A  native 
S-PLUS  minimization  routine  is  used  in  both  the  optimal  and  GCV  smooth¬ 
ing  parameter  selection.  This  routine  searches  for  local  extrema  on  an  inter¬ 
val.  Reasonable  a-search  regions  are  selected  separately  for  each  case.  It  is 
interesting  to  note  that  the  GCV  search  is  more  efficient  than  the  optimal  oc 
search. 

The  computing  platform  used  is  a  Silicon  Graphics  SGI-4D/PCXL-8  with 
24  200-MHz  IP19  processors  and  1  GB  of  main  memory.  In  this  environment, 
the  simulation  runs  in  about  24  hours. 


4.3.1  Observations 

For  the  study  with  normal  noise  and  n  =  100  (Table  4.1),  the  NEMS  and  AR 
estimators  are  competitive,  whereas  the  Fourier  estimator  has  much  larger 
errors.  With  optimal  smoothing-parameter  selection,  the  NEMS  error  ex¬ 
ceeds  the  AR  error  in  six  out  of  eight  cases;  and  with  automatic  selection, 
the  AR  error  exceeds  the  NEMS  error  in  four  out  of  eight  cases.  The  Fourier 
estimator  has  much  higher  errors. 

The  NEMS  estimator  performs  slightly  better  than  the  AR  estimator  in 
the  case  of  uniform  noise  (Table  4.2).  With  optimal  smoothing-parameter 
selection,  the  NEMS  error  exceeds  the  AR  error  in  two  out  of  six  cases;  and 
with  automatic  selection,  the  AR  error  exceeds  the  NEMS  error  in  five  out 
of  six  cases. 

Several  interesting  observations  arise  firom  consideration  of  the  results  in 
Table  4.3.  Since  various  sample  sizes  are  tested  here,  the  data  provide  an 
empirical  indication  of  the  rates  of  L\  error  convergence  for  the  NEMS  and 
AR  estimators  in  the  cases  of  optimal  and  automatic  smoothing-parameter 
selection.  This  is  limited,  of  course,  to  the  two  signal  and  noise  combinations 
under  study  here. 
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Error  variances  are  comparable,  with  the  AR  estimator  having  marginally 
lower  dispersion  for  optimal  selection,  and  the  NEMS  estimator  having 
slightly  lower  dispersion  for  automatic  selection. 

See  Figures  4.12  and  4.13  for  graphical  presentations  of  the  mean  error 
rates.  Expected  Li  error  can  be  modeled  as 

^WS-fh^kn”, 

and  the  coefficients  k  and  p  can  be  obtained  from  the  Table  4.3  results  by 
linear  regression.  Estimates  of  the  coefficients  are  presented  in  Table  4.4. 

The  Fourier  estimator  has  higher  error  and  a  slower  rate  of  convergence. 
With  optimal  selection,  the  AR  estimator  has  lower  errors  than  the  NEMS 
estimator  for  both  distributions  tested.  Also,  the  AR  error  converges  at  a 
faster  rate. 

With  automatic  selection,  convergence  rates  of  the  AR  and  NEMS  es¬ 
timators  are  practically  the  same.  The  NEMS  error  is  slightly  lower  for 
distribution  N5,  and  the  AR  estimator  has  a  slightly  better  rate  for  N2. 

The  AR  estimator  (for  N2  and  N5)  and  the  NEMS  estimator  (for  N2) 
have  optimal  selection  error  convergence  rates  that  are  faster  than  the  cor¬ 
responding  automatic  rates,  as  would  be  expected.  The  small  differences  in 
slope  indicate  that  the  automatic-selection  procedures  give  estimators  that 
almost  achieve  the  optimal-selection  rate.  For  these  cases,  note  that  the  lines 
in  Figure  4.13  are  practically  parallel. 

In  conclusion,  asymptotic  regression  provides  a  reasonable  and  practical 
technique  for  deconvolution  estimation. 
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Table  4.1.  Mean  Li  Error  for  Fourier,  NEMS,  and  AR  Deconvolution  Esti¬ 
mators  with  Normal  <f>{‘ ;  a)  Noise,  n  =  100 


/ 

a 

optimal  a 
Fourier  NEMS 

AR 

automatic  a 

NEMS  AR 

N1 

{D2) 

.50 

.389 

.210 

.210 

.326 

.250 

N2 

(D2) 

.29 

.280 

.167 

.154 

.205 

.205 

N3 

(D3) 

.58 

.301 

.201 

.205 

.254 

.258 

N4 

(D4) 

.50 

.242 

.131 

.105 

.157 

.156 

N5 

(D4) 

.29 

.195 

.127 

.116 

.144 

.160 

N6 

(D5) 

.29 

.265 

.231 

.226 

.246 

.287 

N7 

(D6) 

.29 

.209 

.136 

.121 

.150 

.162 

N8 

(Dl) 

.29 

.298 

.181 

.168 

.225 

.213 

Table  4.2.  Mean  Li  Error  for  NEMS  and  AR  Deconvolution  Estimators 
with  Uniform  u{- ;  1)  Noise,  n  =  100 


/ 

optimal 
NEMS  AR 

automatic 

NEMS  AR 

U1 

{D2) 

.169 

.169 

.194 

.228 

U3 

{D3) 

.168 

.170 

.189 

.224 

U4 

(D4) 

.133 

.111 

.170 

.162 

U6 

(D5) 

.232 

.236 

.269 

.289 

U7 

(D6) 

.134 

.147 

.164 

.182 

U8 

(Dl) 

.181 

.177 

.209 

.231 
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Table  4.3.  Mean  and  Standard  Deviation  of  L\  Error  for  Fourier,  NEMS, 
and  AR  Deconvolution  Estimators  with  Normal  <j){-  ;  .29)  Noise,  n  =  50, 
100,  250,  500,  and  1000 


optimal  a 

automatic  a 

Fourier 

NEMS 

AR 

NEMS 

AR 

n 

X 

s 

X 

s 

X 

s 

X 

s 

•  X 

s 

N2 

50 

.312 

.079 

.217 

.077 

.202 

.078 

.254 

.083 

.268 

.108 

100 

.280 

.060 

.167 

.060 

.154 

.058 

.205 

.067 

.205 

.090 

250 

.244 

.045 

.120 

.042 

.108 

.038 

.147 

.048 

.144 

.067 

500 

.222 

.033 

.094 

.032 

.081 

.029 

.114 

.037 

.112 

.053 

1000 

.204 

.025 

.071 

.023 

.062 

.021 

.086 

.027 

.086 

.038 

N5 

50 

.227 

.072 

.163 

.076 

.154 

.063 

.196 

.082 

.208 

.092 

100 

.195 

.051 

.126 

.053 

.116 

.047 

.144 

.053 

.160 

.066 

250 

.163 

.036 

.093 

.036 

.079 

.030 

.103 

.036 

.112 

.041 

500 

.142 

.026 

.074 

.026 

.061 

.021 

.076 

.026 

.086 

.028 

1000 

.127 

.020 

.055 

.018 

.045 

.014 

.059 

.019 

.062 

.017 

Table  4.4.  Empirical  Li  Error  Rate  CoeflBcients  for  Fourier,  NEMS,  and 
AR  Deconvolution  Estimators  in  the  Model  E||  /  —  /||i  =  fc  n~^ 


optimal  a 

Fourier  NEMS  AR 

automatic  a 

NEMS  AR 

N2  k 

0.541  0.920  0.955 

1.069  1.176 

P 

0.143  0.369  0.396 

0.362  0.379 

N5  k 

0.481  0.655  0.768 

0.925  1.005 

P 

0.195  0.355  0.411 

0.400  0.400 
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Figure  4.1.  AR  Density  Estimates  with  Various  Discretization  Grid  Sizes, 
Buffalo  Snowfall  Data,  n  =  63. 
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Figure  4.2.  AR  Density  Estimates  with  Various  Smoothing  Parameter  Val 
ues,  Buffalo  Snowfall  Data,  n  =  63. 
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Figure  4.3.  AR  Density  Estimates  with  Various  Penalty  Functional  Orders, 
Buffalo  Snowfall  Data,  n  =  63. 
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Figure  4.5.  Recursive  AR  Density  Estimate  Sequences  for  Various  Penalty 
Functional  Orders,  Buffalo  Snowfall  Data,  n  =  63. 
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Figure  4.7.  AR-GCV  Density  Estimate,  ;9(-,  3, 5),  n  =  100. 


109 


Figure  4.8.  GCV  Score  and  AR  Deconvolution  Estimates,  y0(-,  3,5)  * 
(^(•;0.1),  n  =  100. 
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Figure  4.9.  AR-GCV  Deconvolution  Estimate,  j9(-,  3, 5)  ♦  0.1),  n  =  100. 
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Figure  4.11.  AR-GCV  Corpuscle  Estimate,  /3(-,5,3),  n  =  250. 
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Figure  4.12.  Empirical  Li  Error  for  Fourier,  NEMS,  and  AR  Deconvolution 
Estimators. 
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Figure  4.13.  Empirical  Li  Error  for  NEMS  and  AR  Deconvolution  Estima¬ 
tors. 
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Intentionally  Left  Blank. 
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Appendix.  Estimation  for  Gaussian 
Processes 


We  consider  random  variables  X  where  F  is  a  space  of  real-valued 

functions  on  some  set  I.  Typically  ICR,  and  here  we  take  I  =  [0, 1]. 

Definition  (Gaussian  Process).  The  stochastic  process  X(t)  is  said  to  be 
Gaussian  if  the  finite-dimensional  distributions  of  X  are  multivariate  normal; 
i.e.,  if  for  every  positive  integer  n  emd  vector  (ii,...  ,tn)  C  I”,  the  vector 
(X(ti), . . .  ,X(t„))  has  a  multivariate  normal  distribution. 

A  Gaussian  process  process  X{t)  =  m{t)  H-  n{t)  is  characterized  by  its 
mean  value  function  EX{t)  =  m{t)  and  covariance  function  K{s,t)  = 
Cov[X(s),  X(t)]  =  E  n{s)  n{t). 

Definition  (Reproducing  Kernel  Hilbert  Space).  A  Hilbert  space  Hk 
of  functions  on  I  with  inner  product  (/,  is  a  reproducing  kernel  Hilbert 
space  (RKHS)  if  for  each  i  €  I  the  point  evaluation  functional  Vt,  defined  by 
Vt{S)  =  fit)  for  all  /  €  IT,  is  continuous. 

In  a  RKHS,  each  point  evaluation  functioned  Vt  has  a  Riesz  representation 
Vt{f)  =  {Kt,f)  =  f(t)  for  a  unique  Kt  G  H.  The  function  K{s,t)  = 
Vt{Ks)  =  {Kt,  Kt)j^  =  Kg{t)  is  called  the  reproducing  kernel  of  H. 

The  linear  span  of  X  is  the  space 

L{X)  =  I^OiX(ti) :  n  G  N,  Of  G  R,  UGl 

which  is  an  inner  product  space  with  inner  product  {u,  v)  =  E{uv)  and  norm 
||u||  =  y/Eu^.  The  Hilbert  space  generated  by  X{t),  denoted  L2{X),  is  the 
II  •  ||-completion  of  L{X). 

Let  Hk  be  the  RKHS  with  reproducing  kernel  K{s,t)  =  E[X(s)A’(t)] 
and  inner  product  (•,  •)j^.  Let  (j) :  Hk  L^iX)  be  defined  by  <f>{Kt)  =  X{t). 
Then  (f>  has  the  properties 

E^(/)  =  (/,m)^  and  E<f>{f)(t>{g)  =  {f,g)K . 


117 


Note  that  {X(i)  :  t  G  J}  generates  L2(X),  and  {Kt  :  t  €  1}  generates 
fffc.  Furthermore,  ^  is  an  inner-product-preserving,  bijective  linear  transfor¬ 
mation  in  the  sense  that 

L2(X)  ~  {fi9)K' 

Now  let  X[t)  be  a  stocheustic  process  with  mean  value  function  E  = 
m{t)  and  known  covariance  function  K{s,t)  =  Cov[A’(t),Jf(s)],  and  let  Hk 
be  the  RKHS  with  reproducing  kernel  K.  In  this  case,  the  function  0  :  Hk 
^2{X)  given  by  (j>{Kt)  =  X{t)  has  the  properties 

E^(/)  =  (/,m)^  and  Cow[(t>{f),<l>{g)\=  {f,g)^ . 

Theorem  A.l.  Let  {-X’(t)  :  t  E  1}  be  a  stochastic  process  with  mean  value 
m{t)  =  EX{t)  and  covariance  K{s,t)  =  Cov[A’(s),X(t)],  where 

(1)  I  is  countable,  or 

(2)  I  is  separable,  K  is  continuous,  andX{t)  is  separable. 

Let  be  the  probability  measure  induced  by  X  on  the  space  of  sample  paths. 
Let  3^0  be  the  probability  induced  by  a  zero-mean  Gaussian  process  with  covari¬ 
ance  function  K .  Then  IPo  ond  are  orthogonal  if  m  ^  Hk  and  equivalent 
if  m  E  Hk)  in  which  case 

dT 

=  «q>  [,A(m)  -  l||m||y  . 

Proof.  See  Parzen  [51].  □ 

The  following  theorem  is  useful  in  that  it  gives  explicit  formulas  for  the 
RKHS  inner  products  that  we  need. 

Theorem  A.2.  Consider  the  RKHS  of  functions  on  [o,  6]  with  0  ^  a  <  b  ^ 
1,  where  the  reproducing  kernel  has  the  form 

K{s,t)  =  «(s  A  <)  u(5  V  t). 
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Letx{s,t)  =  u{syt)v{sf\t)—u{s/\t)v{s\/t),  andlety{i)  =  u\t)v{t)—u{t)v'{t). 
If  x{s,  t)  >  0  and  y{s)  >  0  for  all  s  and  t  ^  s  in  [o,  6],  then  the  corresponding 
RKHS  inner  product  is  given  by 


\\F\\i = r 

Ja 


[{F/vyf  .  F{a) 


+ 


(u/v)'  u[a)v{a)' 


Proof.  See  Sacks  and  Ylvisaker  [59]  for  a  derivation.  We  simply  verify  the 
reproducing  property.  Here,  /  means  /  ds. 


(K  W\  -  f' (Kt/vriF/vy  Ktia)Fia) 

'  *  Ja  iV‘/'^y  u{a)v{a) 

Ja  {'^hy  Jt  {ft-hy 

=  v{t)  f  {F/v)'  +  u{t)  •  0  +  t;(t)F(a)/t;(o) 

Ja 

=  v(t)[F(t)/v(t)  -  F(o)/u(o)]  +  t;(t)F(a)/v(o) 

= m- 


u{a)  v(t)  F{a) 
u{a)  v{a) 


□ 

Corollary  A.3.  Consider  the  RKHS  of  functions  on  [a,b]  with  0  ^  o  <  6  ^ 
1,  where  the  reproducing  kernel  has  the  form 

K{s,t)  =  u{s  At)v{sV  t)w{s)w{t). 


The  corresponding  RKHS  inner  product  is  given  by 

\\F\\l  =  \\FH\l 

Proof.  Note  that  w{s)w{t)  =  w{sAt)w{sWt).  Apply  the  theorem  to  K{s,  t)  = 
u{s  A  t)w(s  A  t)  ■  v{s  V  t)w{s  V  t);  i.e.,  replace  u  by  uw  and  v  by  vw  in  the 
form  II  •  112  to  obtain 

np|,2  ^  f'[{F/{vw))f  Fjaf 

Ja  {u/vy  u{a)v{a)w{aY' 

as  required.  □ 
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As  an  aid  in  performing  calculations,  we  can  write 


Wi;  =  / 

Ja  U((l)V(CZj 


where  J  =  [(F/t;)']^/(«/t;)'.  Since  (u/v)'  =  p/v^,  the  integrand  is 


^2  JP'\2  _ 

J  - - Z - 


y 


„2  ,  2F'Fv  -  F^v' 


=  1.  (f'^-v‘ 


) 


We  consider  several  important  examples  of  RKHS’s  with  this  type  of  inner 
product  structure. 


Example  (1).  Let  K{s,t)  =  G{s  A  i),  where  G  is  non-negative  and  in¬ 
creasing  on  I.  Then  u{t)  =  G{t)  and  v{t)  =  1.  Observe  that  x{s,t)  = 
G{s  V  t)  —  G{s  A  t)  >  0  if  s  7^  t,  y{t)  =  G'{t)  >  0  for  all  t,  and  v'  =  0.  So 


and  the  quadratic  form  is 


na? 

G{a)‘ 


(A.1) 


Example  (2).  Let  K{s,t)  =  G{s  At)  —  G{s)G{t),  where  G  is  non-negative 
and  increasing  on  1.  Then  u{t)  =  G{t)  and  v{t)  =  1  —  G(t).  Observe  that 


x{s,  t)  =  G{s  V  t)[l  —  G{s  A  t)]  —  G{s  A  t)[l  —  G(s  V  t)] 
=  G(s  V t)  —  G(s  At), 


so  X  >  0  if  s  ^  t.  Furthermore,  p(t)  =  G'(t)[l  -  G(t)]  +  G(t)G'(t)  =  G'(t). 
So  p  >  0  for  all  t,  and  u'  =  —p.  Thus, 


The  quadratic  form  is 


F{bf 
1  -  G{b) 


+ 


1 

G(a)[l-G(a)] 


1 

1  -  G{a) 


) 


or 


F{a) 

G{a) 


F{b)^ 

1-G{by 


(A.2) 


Example  (3).  Let  K{s,t)  =  G{s)G{t){s  At  —  st),  where  G  is  non-negative 
and  increasing  on  I.  Then  u{t)  =  t,  v{t)  =  1  —  t,  w{t)  =  G{t),  and  the 
quadratic  form  is 


\\F\f 


1 

-I-- 

a 


F(a)]^  1  [F{b) 

LgJo)]  1  -  fe  [(?(&) J 


t2 


(A.3) 


Intentionally  Left  Blank. 
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